report 123

47
7/21/2019 Report 123 http://slidepdf.com/reader/full/report-123-56d9d3a504823 1/47 1. INTRODUCTION Automatic Speech Recognition Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech. What does speaker dependent / adaptie / independent mean! A speaker dependent system is developed to operate for a single speaker. These systems are usually easier to develop, cheaper to buy and more accurate, than but not as flexible as speaker adaptive or speaker independent systems. A speaker independent system is developed to operate for any speaker of a particular type (e.g. American English). These systems are the most difficult to develop, most expensive and accuracy is lower than speaker dependent systems. owever, they are more flexible. A speaker adaptive system is developed to adapt its operation to the characteristics of new speakers. !ts difficulty lies somewhere between speaker independent and speaker dependent systems. What does continuous speech or iso"ated#$ord mean! An isolated"word system operates on single words at a time " re#uiring a pause between saying each word. This is the simplest form of recognition to perform because the end points are easier to find and the pronunciation of a word tends not affect others. Thus, because the occurrences of words are more consistent they are easier to recogni$e. A continuous speech system operates on speech in which words are connected together, i.e. not separated by pauses. %ontinuous speech is more difficult to handle because of a variety of effects. &irst, it is difficult to find the start and end points of words. Another problem is 'co articulation'. The production of each phoneme is affected by the

Upload: siva-krishna

Post on 05-Mar-2016

219 views

Category:

Documents


0 download

DESCRIPTION

automatic speech recognition

TRANSCRIPT

Page 1: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 1/47

1. INTRODUCTION

Automatic Speech Recognition

Automatic speech recognition is the process by which a computer maps an acoustic

speech signal to text. Automatic speech understanding is the process by which a computer

maps an acoustic speech signal to some form of abstract meaning of the speech.

What does speaker dependent / adapti e / independent mean!

A speaker dependent system is developed to operate for a single speaker. These systems

are usually easier to develop, cheaper to buy and more accurate, than but not as flexible

as speaker adaptive or speaker independent systems.

A speaker independent system is developed to operate for any speaker of a particular type

(e.g. American English). These systems are the most difficult to develop, most expensive

and accuracy is lower than speaker dependent systems. owever, they are more flexible.

A speaker adaptive system is developed to adapt its operation to the characteristics of

new speakers. !ts difficulty lies somewhere between speaker independent and speaker

dependent systems.

What does continuous speech or iso"ated#$ord mean!

An isolated"word system operates on single words at a time " re#uiring a pause between

saying each word. This is the simplest form of recognition to perform because the end

points are easier to find and the pronunciation of a word tends not affect others. Thus,

because the occurrences of words are more consistent they are easier to recogni$e.

A continuous speech system operates on speech in which words are connected together,

i.e. not separated by pauses. %ontinuous speech is more difficult to handle because of a

variety of effects. &irst, it is difficult to find the start and end points of words. Another

problem is 'co articulation'. The production of each phoneme is affected by the

Page 2: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 2/47

production of surrounding phonemes, and similarly the the start and end of words are

affected by the preceding and following words. The recognition of continuous speech is

also affected by the rate of speech (fast speech tends to be harder).

How is speech recognition performed?

A wide variety of techni#ues are used to perform speech recognition. There are many

types of speech recognition. There are many levels of speech recognition analysis

understanding.

Typically speech recognition starts with the digital sampling of speech. The next stage is

acoustic signal processing. *ost techni#ues include spectral analysis+ e.g. -% analysis

( inear -redictive %oding), *&%% (*el &re#uency %epstral %oefficients), cochlea

modelling and many more.

The next stage is recognition of phonemes, groups of phonemes and words. This stage

can be achieved by many processes such as T/ ( ynamic Time /arping), **

(hidden *arkov modelling), 00s (0eural 0etworks), expert systems and combinations

of techni#ues. **"based systems are currently the most commonly used and most

successful approach. *ost systems utili$e some knowledge of the language to aid the

recognition process.

1ome systems try to 'understand' speech. That is, they try to convert the words into a

representation of what the speaker intended to mean or achieve by what they said.

This is a simple recogni$er that should give you 2345 recognition accuracy. The

accuracy is a function of the words you have in your vocabulary. ong distinct words are

easy. 1hort similar words are hard. 6ou can get 7254 on the digits with this recogni$er.

8verview9

&ind the beginning and end of the utterance.

&ilter the raw signal into fre#uency bands.

%ut the utterance into a fixed number of segments.

Average data for each band in each segment.

:

Page 3: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 3/47

1tore this pattern with its name.

%ollect training set of about ; repetitions of each pattern (word).

<ecogni$e unknown by comparing its pattern against all patterns in the training set and

returning the name of the pattern closest to the unknown. *any variations upon the

theme can be made to improve the performance. Try different filtering of the raw signaland different processing methods

Automatic speech recognition and speaker verification are among the most challenging

problems of modern man machine interactions. Among their numerous useful

applications are a future =chuckles> society in which all financial transactions are

executed over the telephone and =signed> by voice. Access to confidential data can be

made secure by speaker certification. 8ther applications include voice information and

reservation system covering a wide spectrum of human activities from travel and study to

purchasing and partner matching. !n these applications, spoken re#uests (over the

telephone, say) are understood by machines and answered by synchroni$ed voice. ?oice

control of computers and spacecraft (and machines in general whose operators have

limited use of their hands) is an aspiration of long standing. Activation by voice could be

particularly beneficial for the severally handicapped who have lost one or several limbs.

The surgeon in the middle of operation, needing the latest medical information, is another

instance where only the acoustic channels are still fully available for re#uesting andreceiving the urgently re#uired advice. The ending of =manuscripts> by voice may

supplement much present paper and pushing or mouse play at the graphics terminals.

T he potential applications of speech and speaker recognition are boundless. As early as

7@@ speaker identification was used successfully by the allies to trace the movements of

erman combat unit by analy$ing speech

1pectrograms of enemy voice traffic .remarkably, the human ear is often able to identify a

telephone caller on the basis of a simple =hello> or Bust the clearing of his throat. Cut thedifficulties of recognition by the machine can be staggering. even if we forego automatic

accent classification and especially if we persuade the banks to live with less than

perfection in voice signature(which they really do not need ,considering the large number

of unsigned or falsely signed checks that clear the system everyday)reliable voice

recognition from the large pools of potential speakers on the basis of their speech

;

Page 4: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 4/47

alone ,will remain problematic for years to come .and as, widely appreciated by now ,the

automatic recognition of anything but isolated words from a limited vocabulary 1poken

by known speakers presents formidable difficulties .decades of painstaking(and

pain)research have shown that purely technical advances will result ,at best limited

improvements"far short of what is child>s play for the human mind.

%. &rincip"es o' Speaker Recognition

1peaker recognition can be classified into identification and verification. 1peaker

identification is the process of determining which registered speaker provides a given

utterance. 1peaker verification, on the other hand, is the process of accepting or reBecting

the identity claim of a speaker. &igure shows the basic structures of speaker identification and verification systems.

1peaker recognition methods can also be divided into textindependent and text"

dependent methods. !n a text"independent system, speaker models capture characteristics

of somebody>s speech which show up irrespective of what one is saying. !n a text"

dependent system, on the other hand, the recognition of the speaker>s identity is based on

his or her speaking one or more specific phrases, like passwords, card numbers, -!0

codes, etc. All technologies of speaker recognition, identification and verification, text"

independent and text"dependent, each have its own advantages and disadvantages and

may re#uires different treatments and techni#ues. The choice of which technology to use

is application"specific.

At the highest level, all speaker recognition systems contain two main modules (refer to

&igure )9 feature extraction and feature matching. &eature extraction is the process that

extracts a small amount of data from the voice signal that can later be used to representeach speaker. &eature matching involves the actual procedure to identify the unknown

speaker by comparing extracted features from his her voice input with the ones from a set

of known speakers. /e will discuss each module in detail in later sections.

@

Page 5: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 5/47

All speaker recognition systems have to serve two distinguish phases. The first one is

referred to the enrollment sessions or training phase while the second one is referred to asthe operation sessions or testing phase. !n the training phase, each registered speaker has

to provide samples of their speech so that the system can build or train a reference model

for that speaker. !n case of speaker verification systems, in addition, a speaker"specific

threshold is also computed from the training samples. uring the testing (operational)

phase (see &igure ), the input speech is matched with stored reference model(s) and

recognition decision is made.

1peaker recognition is a difficult task and it is still an active research area. Automatic

speaker recognition works based on the premise that a person>s speech exhibits

characteristics that are uni#ue to the speaker. owever this task has been challenged by

the highly variant of input speech signals. The principle source of variance comes form

the speakers themselves. 1peech signals in training and testing sessions can be greatly

different due to many facts such as people voice change with time, health conditions (e.g.

3

Page 6: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 6/47

the speaker has a cold), speaking rates, etc. There are also other factors, beyond speaker

variability, that present a challenge to speaker recognition technology. Examples of these

are acoustical noise and variations in recording environments.

(. Speech )eature *+traction

The purpose of this module is to convert the speech waveform to some type of parametric

representation (at a considerably lower information rate) for further analysis and

processing. This is often referred as the signal"processing front end.

The speech signal is a slowly timed varying signal (it is called #uasi"stationary). An

example of speech signal is shown in &igure :. /hen examined over a sufficiently short

period of time (between 3 and DD msec), its characteristics are fairly stationary.

owever, over long periods of time (on the order of 3 seconds or more) the signal

characteristic change to reflect the different speech sounds being spoken. Therefore,

short"time spectral analysis is the most common way to characteri$e the speech signal.

A wide range of possibilities exist for parametrically representing the speech signal for

the speaker recognition task, such as inear -rediction %oding ( -%), *el"&re#uency

%epstrum %oefficients (*&%%), and others. *&%% is perhaps the best known and most

popular, and these will be used in this proBect.

Page 7: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 7/47

&igure :. An example of speech signal

*&%%>s are based on the known variation of the human ear>s critical bandwidths with

fre#uency, filters spaced linearly at low fre#uencies and logarithmically at high

fre#uencies have been used to capture the phonetically important characteristics of

speech. This is expressed in the mel"fre#uency scale, which is a linear fre#uency spacing

below DDD $ and a logarithmic spacing above DDD $. The process of computing

*&%%s is described in more detail next.

,.-e"#'re uenc cepstrum coe''icients processor

F

Page 8: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 8/47

A block diagram of the structure of an *&%% processor is given in &igure ;. The speech

input is typically recorded at a sampling rate above DDDD $. This sampling fre#uency

was chosen to minimi$e the effects of aliasing in the analog"to"digital conversion. These

sampled signals can capture all fre#uencies up to 3 k $, which cover most energy of

sounds that are generated by humans. As been discussed previously, the main purpose of

the *&%% processor is to mimic the behavior of the human ears. !n addition, rather than

the speech waveforms themselves, *&&%>s are shown to be less susceptible to mentioned

variations.

)igure (. 0"ock diagram o' the -)CC processor

,.1)rame 0"ocking

!n this step the continuous speech signal is blocked into frames of 0 samples, with

adBacent frames being separated by * (* G 0). The first frame consists of the first 0

samples. The second frame begins * samples after the first frame, and overlaps it by 0

"* samples. 1imilarly, the third frame begins :* samples after the first frame (or *samples after the second frame) and overlaps it by 0 ":* samples. This process

continues until all the speech is accounted for within one or more frames. Typical values

for 0 and * are 0 H :3 (which is e#uivalent to I ;D =m>sec windowing and facilitate the

fast radix": &&T) and * H DD.

2

Page 9: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 9/47

,.% Windo$ing

The next step in the processing is to window each individual frame so as to minimi$e the

signal discontinuities at the beginning and end of each frame. The concept here is to

minimi$e the spectral distortion by using the window to taper the signal to $ero at the

beginning and end of each frame. !f we define the window as w(n),D J n J 0 K , where 0 is the number of samples in each frame, then the result of windowing is the signal

)()()( nwn xn y = , D n N "

Typically the amming window is used, which has the form9

The next processing step is the &ast &ourier Transform, which converts each frame of 0

samples from the time domain into the fre#uency domain. The &&T is a fast algorithm to

implement the iscrete &ourier Transform ( &T) which is defined on the set of 0

samples LxnM, as follow9

0ote that we use B here to denote the imaginary unit, i.e.BH − !n general Nn>s are

complex numbers. The resulting se#uence LNnM is interpreted as follow9 the $ero

fre#uency corresponds to n H D, positive fre#uencies D G f G & s : correspond to values

J n J 0 : K , while negative fre#uencies K & s : G f G D correspond to 0 : 5 J n J 0K . ere, &s denotes the sampling fre#uency. The result obtained after this step is often

referred to as signal>s spectrum or periodogram.

7

Page 10: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 10/47

,.( -e"#'re uenc $rapping

As mentioned above, psychophysical studies have shown that human perception of the

fre#uency contents of sounds for speech signals does not follow a linear scale. Thus for

each tone with an actual fre#uency, f, measured in $, a subBective pitch is measured on a

scale called the =*el> scale. The *el"fre#uency scale is linear fre#uency spacing belowDDD $ and a logarithmic spacing above DDD $. As a reference point, the pitch of a

k $ tone, @D dC above the perceptual hearing threshold, is defined as DDD *els.

Therefore we can use the following approximate formula to compute the *els for a given

fre#uency f in $9 8ne approach to simulating

the subBective spectrum is to use a filter bank, one filter for each desired *el"fre#uency

component (see &igure @). That filter bank has a triangular band pass fre#uency response,

and the spacing as well as the bandwidth is determined by a constant *el"fre#uency

interval. The modified spectrum of 1 (O) thus consists of the output power of these filters

when 1 (O) is the input. The number of *el spectrum coefficients, P, is typically chosen

as :D.

0ote that this filter bank is applied in the fre#uency domain+ therefore it simply amounts

to taking those triangle"shape windows in the &igure @ on the spectrum. A useful way of

thinking about this *el"wrapping filter bank is to view each filter as a histogram bin

(where bins have overlap) in the fre#uencydomain.

D

Page 11: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 11/47

)igure ,. An e+amp"e o' -e"#spaced 'i"ter 2ank

4.4Cepstrum

!n this final step, we convert the log *el spectrum back to time. The result is called the

*el fre#uency cepstrum coefficients (*&%%). The cepstral representation of the speech

spectrum provides a good representation of the local spectral properties of the signal for

the given frame analysis. Cecause the *el spectrum coefficients (and so their logarithm)

are real numbers, we can convert them to the time domain using the iscrete %osine

Transform ( %T). Therefore if we denote those *el power spectrum coefficients that are

the result of the last step are we can calculate the *&%%>s as

Page 12: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 12/47

0ote that we exclude the first component, c D, from the %T since it represents the mean

value of the input signal which carried little speaker specific information.

3. )eature -atching

The problem of speaker recognition belongs to a much broader topic in scientific and

engineering so called pattern recognition. The goal of pattern recognition is to classify

obBects of interest into one of a number of categories or classes. The obBects of interest

are generically called patterns and in our case are se#uences of acoustic vectors that are

extracted from an input speech using the techni#ues described in the previous section.

The classes here refer to individual speakers. 1ince the classification procedure in our

case is applied on extracted features, it can be also referred to as feature matching.

&urthermore, if there exists some set of patterns that the individual classes of which are

already known, then one has a problem in supervised pattern recognition. This is exactly

our case since during the training session, we label each input speech with the ! of the

speaker (1 to 12). These patterns comprise the training set and are used to derive a

classification algorithm. The remaining patterns are then used to test the classification

algorithm+ these patterns are collectively referred to as the test set. !f the correct classes

of the individual patterns in the test set are also known, then one can evaluate the

performance of the algorithm.

The state"of"the"art in feature matching techni#ues used in speaker recognition includes

ynamic Time /arping ( T/), idden *arkov *odeling ( **), and ?ector

Quanti$ation (?Q). !n this proBect, the ?Q approach will be used, due to ease of

implementation and high accuracy. ?Q is a process of mapping vectors from a large

vector space to a finite number of regions in that space. Each region is called a cluster

and can be represented by its center called a codeword. The collection of all codewords is

called a codebook.

:

Page 13: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 13/47

&igure 3 shows a conceptual diagram to illustrate this recognition process. !n the figure,

only two speakers and two dimensions of the acoustic space are shown. The circles refer

to the acoustic vectors from the speaker while the triangles are from the speaker :. !n

the training phase, a speaker"specific ?Q codebook is generated for each known speaker

by clustering his her training acoustic vectors. The result codewords (centroids) are

shown in &igure 3 by black circles and black triangles for speaker and :, respectively.

The distance from a vector to the closest codeword of a codebook is called a ?Q"

distortion. !n the recognition phase, an input utterance of an unknown voice is Rvector"

#uanti$edS using each trained codebook and the total ?Q distortion is computed. The

speaker corresponding to the ?Q codebook with smallest total distortion is identified.

&igure 3. %onceptual diagram illustrating vector #uanti$ation codebook formation. 8ne

speaker can be discriminated from another based of the location of

centroids.

3.1 C"ustering the Training 4ectors

After the enrolment session, the acoustic vectors extracted from input speech of a speaker

provide a set of training vectors. As described above, the next important step is to build a

speaker"specific ?Q codebook for this speaker using those training vectors. There is a

;

Page 14: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 14/47

Page 15: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 15/47

procedure. R%ompute (distortion)S sums the distances of all training vectors in the

nearest"neighbor search so as to determine whether the procedure has converged.

&igure . &low diagram of the C algorithm (Adapted from <abiner and Xuang, 77;)

5.Imp"ementation

3

Page 16: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 16/47

As stated above, in this proBect we will experience the building and testing of an

automatic speaker recognition system. !n order to implement such a system, one must go

through several steps which were described in details in previous sections. 0ote that

many of the above tasks are already implemented in *atlab. &urthermore, to ease the

development process, we supplied you with two utility functions9 melfb and disteu+ and

two main functions9 train and test. ownload all of those files into your working folder.

The first two files can be treated as a black box, but the later two needs to be thoroughly

understood. !n fact, your tasks are to write two missing functions9 mfcc and v#lbg, which

will be called from the given main functions. !n order to accomplish that, follow each

step in this section carefully and answer all the #uestions.

Speech Data

%lick here to down"load the Y!- file of the speech database. After un$ipping the file

correctly, you will find two folders, T<A!0 and TE1T, each contains 2 files, named9

1 ./A?, 1:./A?, Z, 12./A?+ each is labeled after the ! of the speaker. These files

were recorded in *icrosoft /A? format. !n /indows systems, you can listen to the

recorded sounds by double clicking into the files.

8ur goal is to train a voice model (or more specific, a ?Q codebook in the *&%% vector

space) for each speaker 1 "12 using the corresponding sound file in the T<A!0 folder.After this training step, the system would have knowledge of the voice characteristic of

each (known) speaker. 0ext, in the testing phase, the system will be able to identify the

(assumed unknown) speaker of each sound file in the TE1T folder.

Question 9 -lay each sound file in the T<A!0 folder. %an you distinguish the voices of

those eight speakers[ 0ow play each sound in the TE1T folder in a random order without

looking at the file name (pretending that you do not known the speaker) and try to

identify the speaker using your knowledge of their voices that you Bust learned from the

T<A!0 folder. This is exactly what the computer will do in our system. /hat is your

(human performance) recognition rate[ <ecord this result so that it could be later on

compared against the computer performance of our system.

Speech &rocessing

Page 17: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 17/47

!n this phase you are re#uired to write a *atlab function that reads a sound file and turns

it into a se#uence of *&%% (acoustic vectors) using the speech processing steps

described previously. *any of those tasks are already provided by either standard or our

supplied *atlab functions. The *atlab functions that you would need to use are9

wavread, hamming, fft, dct and melfb (supplied function). Type help function name at the

*atlab prompt for more information about a function.

Question :9 <ead a sound file into *atlab. %heck it by playing the sound file in *atlab

using the function9 sound. /hat is the sampling rate[ /hat is the highest fre#uency that

the recorded sound can capture with fidelity[ /ith that sampling rate, how many msecs

of actual speech are contained in a block of :3 samples[

-lot the signal to view it in the time domain. !t should be obvious that the raw data in the

time domain has a very high amount of data and it is difficult for analy$ing the voice

characteristic. 1o the motivation for this step (speech feature extraction) should be clear

now\

0ow cut the speech signal (a vector) into frames with overlap (refer to the frame section

in the theory part). The result is a matrix where each column is a frame of 0 samples

from original speech signal. Applying the steps9 /indowing and &&T to transform the

signal into the fre#uency domain. This process is used in many different applications andis referred in literature as /indowed &ourier Transform (/&T) or 1hort"Time &ourier

Transform (1T&T). The result is often called as the spectrum or periodogram.

Question ;9 After successfully running the preceding process, what is the interpretation of

the result[ %ompute the power spectrum and plot it out using the imagesc command.

0ote that it is better to view the power spectrum on the log scale. ocate the region in the

plot that contains most of the energy. Translate this location into the actual ranges in time

(msec) and fre#uency (in $) of the input speech signal.

Question @9 %ompute and plot the power spectrum of a speech file using different frame

si$e9 for example 0 H :2, :3 and 3 :. !n each case, set the frame increment * to be

about 0 ;. %an you describe and explain the differences among those spectra[

F

Page 18: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 18/47

The last step in speech processing is converting the power spectrum into mel"fre#uency

cepstrum coefficients. The supplied function melfb facilitates this task.

Question 39 Type help melfb at the *atlab prompt for more information about this

function. &ollow the guidelines to plot out the mel"spaced filter bank. /hat is the

behavior of this filter bank[ %ompare it with the theoretical part.

&inally, put all the pieces together into a single *atlab function, mfcc, which performs

the *&%% processing.

Question 9 %ompute and plot the spectrum of a speech file before and after the mel"

fre#uency wrapping step. escribe and explain the impact of the melfb program.

4ector 6uanti7ation

The result of the last section is that we transform speech signals into vectors in an

acoustic space. !n this section, we will apply the ?Q"based pattern recognition techni#ue

to build speaker reference models from those vectors in the training phase and then can

identify any se#uences of acoustic vectors uttered by unknown speakers.

Question F9 To inspect the acoustic space (*&%% vectors) we can pick any two

dimensions (say the 3 th and the th) and plot the data points in a : plane. Wse acoustic

vectors of two different speakers and plot data points in two different colors. o the data

regions from the two speakers overlap each other[ Are they in clusters[

0ow write a *atlab function, v#lbg that trains a ?Q codebook using the C algorithm

described before. Wse the supplied utility function disteu to compute the pairwise

Euclidean distances between the codewords and training vectors in the iterative process.

Question 29 -lot the data points of the trained ?Q codewords using the same two

dimensions over the plot from the last #uestion

%ompare this with &igure 3. .

2

Page 19: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 19/47

Simu"ation and * a"uation

0ow is the final part\ Wse the two supplied programs9 train and test (which re#uire two

functions mfcc and v#lbg that you Bust wrote) to simulate the training and testing

procedure in speaker recognition system, respectively.

Question 79 /hat is recognition rate our system can perform[ %ompare this with the

human performance. &or the cases that the system makes errors, re"listen to the speech

files and try to come up with some explanations.

&igure 9 plot of signal s .wav

Question D (optional)9 6ou can also test the system with your own speech files. Wse the

/indow>s program 1ound <ecorder to record more voices from yourself and your friends. Each new speaker needs to provide one speech file for training and one for

testing. %an the system recogni$e your voice[ EnBoy\

&igure 9 plot of signal s .wav

7

Page 20: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 20/47

-ower 1pectrum (*H DD,0H:3 )

&igure :.a9 power spectrum ( *H DD, 0H:3 )

&igure :.b9 ogarithmic -ower 1pectrum (*H DD,0H:3 )

:D

Page 21: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 21/47

&igure ;.a9 -ower 1pectrum (*H@;,0H :2,framesHF F)

&igure ;.b9 -ower 1pectrum (*H23,0H:3 ,framesH;2F)

:

Page 22: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 22/47

&igure ;.c9 -ower 1pectrum (*H F ,0H3 :,framesH 7 )

&igure @9 *el"1paced &ilterbank

::

Page 23: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 23/47

&igure 3.a9 -ower 1pectrum Wnmodified

&igure 3.b9 -ower 1pectrum *odified through *el %epstrum filter

:;

Page 24: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 24/47

&igure 9 : plot of accoustic vectors

:@

Page 25: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 25/47

:3

Page 26: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 26/47

&igure F9 : plot of accoustic vectors

:

Page 27: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 27/47

Page 28: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 28/47

&igure 2.b9 : plot of accoustic vectors

8.A&&*NDI9

INTRODUCTION TO -AT:A0

*AT AC (short for *atrix aboratory) is a special purpose computer program

optimi$ed to perform engineering and scientific caluculation.!t started life as a program

designed to perform matrix mathematic, but over the years it has grown into a flexible

computing system capable of solving essentially any technical problem. The *AT AC

program implements the *AT AC programming language and provides an exclusive

library of predefined function that make technical programming takes easier and more

efficient.

*AT AC is a huge program, with an incredibly rich variety of function. Even the basic

version of *AT AC without any toolkits is much richer than other technical

programming language. There are more than DDD functions in the basic *AT AC

product alone, and the toolkits extend this capability with any more functions in various

specialties.Ad antages o' -AT:A0;

*AT AC has many advantages compared with conventional computer

anguages for technical problem solving. Among them are the following9

*rase o' use 9 *AT AC is an interpreted language, like many version of

CA1!%. The program can be used as a scratch pad to evaluate expressions typed at the

:2

Page 29: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 29/47

command line, or it can be used to execute large prewritten programs. -rogram may be

easily written and modified with the built"in integrated development environment and

debugged with *AT AC debugger.

&"at'orm independence 9

*AT AC is a supported on much different computer system, providing a large measure

of platform independence.

&rede'ined 'unction 9 *AT AC comes complete with an extensive library of

predefined functions that provide tested and prepackaged solutions to many basic

*AT AC language+ many special purpose toolboxes are available to help solve complex

problems in specific areas.

De ice independent p"otting 9 unlike most other computer languages,

*AT AC has many integral plotting and imaging commands. the plots and images can be displayed on many graphical output device supported by the computer on which

*AT AC is running. This capability makes *AT AC an outstanding tool for visuali$ing

technical data.

<raphica" user inter'ace 9 *atlab includes tools that allow a programmer to

interactively construct a W! for his her program.

-AT:A0 compi"er 9 *AT AC>s flexibility and platform independence is

achieved by compiling *AT AC program into a device"independent p"code, and then

interpreting the p"codes instructions at run time .a separate *AT AC compiler is

available. This compiler can compile a *AT AC program into a executable program that

run faster than the interpreted code.

-AT:A0 Commands used in source code;

:*N<T= ; ength of vector.

E0 T (N) returns the length of vector N. !t is e#uivalent

to *AN(1!YE(N)) for non"empty arrays and D for empty ones.

):OOR; <ound towards minus infinity.

& 88<(N) rounds the elements of N to the nearest integers

towards minus infinity.

:7

Page 30: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 30/47

=A--IN<; amming window.

A**!0 (0) returns the 0"point symmetric amming window in a column vector.

A**!0 (0,1& A ) generates the 0"point amming window using 1& A

window

sampling. 1& A may be either ]symmetric] or ]periodic]. Cy default, a symmetric

window is returned.

DIA< 9 iagonal matrices and diagonals of a matrix.

!A (?,P) when ? is a vector with 0 components is a s#uare matrix of order

05AC1(P) with the elements of ? on the P"th diagonal. P H D is the main diagonal, P

^ D is above the main diagonal and P G D is below the main diagonal.

!A (?) is the same as !A (?,D) and puts ? on the main diagonal. !A (N,P)

when N is a matrix is a column vector formed from the elements of the P"th diagonalof N.

!A (N) is the main diagonal of N. !A ( !A (N)) is a diagonal matrix.

Example

m H 3+

diag("m9m) 5 diag(ones(:_m, ), ) 5 diag(ones(:_m, )," )

produces a tridiagonal matrix of order :_m5 .

))T; iscrete &ourier transform.

&&T(N) is the discrete &ourier transform ( &T) of vector N. &or matrices, the &&T

operation is applied to each column. &or 0" arrays, the &&T operation operates on the

first non"singleton dimension.

&&T(N,0) is the 0"point &&T, padded with $eros if N has less

than 0 points and truncated if it has more.

&&T(N, U, !*) or &&T(N,0, !*) applies the &&T operation across the dimension

!*.

&or length 0 input vector x, the &T is a length 0 vector N,

with elements 0

;D

Page 31: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 31/47

N(k) H sum x(n)_exp("B_:_pi_(k" )_(n" ) 0), GH k GH 0.

nH

The inverse &T (computed by !&&T) is given by

0

x(n) H ( 0) sum N(k)_exp( B_:_pi_(k" )_(n" ) 0), GH n GH 0.

kH

WA4R*AD; <ead *icrosoft /A?E ('.wav') sound file.

6H/A?<EA (&! E) reads a /A?E file specified by the string &! E, returning the

sampled data in 6. The '.wav' extension is appended if no extension is given.

Amplitude values are in the range " ,5 U.

6,&1,0C!T1UH/A?<EA (&! E) returns the sample rate (&1) in ert$ and the

number of bits per sample (0C!T1) used to encode the data in the file....UH/A?<EA (&! E,0) returns only the first 0 samples from each channel in the

file.

...UH/A?<EA (&! E, 0 0:U) returns only samples 0 through 0: from each

channel in the file.

1!YH/A?<EA (&! E,]si$e]) returns the si$e of the audio data contained in the file in

place of the actual audio data, returning the vector 1!YH samples channelsU.

6,&1,0C!T1,8-T1UH/A?<EA (...) returns a structure 8-T1 of additional

information contained in the /A? file. The content of this structure differs from file to

file. Typical structure fields include ].fmt] (audio format information) and ].info] (text

which may describe subBect title, copy right, etc.)

1upports multi"channel data, with up to ;: bits per sample.

DIS& 9 isplay array.

!1-(N) displays the array, without printing the array name. !n all other ways it]s the

same as leaving the semicolon off an expression except that empty arrays don]t display.

!f N is a string, the text is displayed.

A9IS; %ontrol axis scaling and appearance.

AN!1( N*!0 N*AN 6*!0 6*ANU) sets scaling for the x" and y"axes on the current

plot.

;

Page 32: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 32/47

AN!1( N*!0 N*AN 6*!0 6*AN Y*!0 Y*ANU) sets the scaling for the x", y" and

$"axes on the current ;" plot.

AN!1( N*!0 N*AN 6*!0 6*AN Y*!0 Y*AN %*!0 %*ANU) sets the scaling

for the x", y", $"axes and color scaling limits on the current axis (see %AN!1).

? H AN!1 returns a row vector containing the scaling for the current plot. !f the current

view is :" , ? has four components+ if it is ;" , ? has six components.

AN!1 AWT8 returns the axis scaling to its default, automatic

mode where, for each dimension, ]nice] limits are chosen based on the extents of all

line, surface, patch, and image children.

AN!1 *A0WA free$es the scaling at the current limits, so that if 8 is turned on,

subse#uent plots will use the same limits.

AN!1 T! T sets the axis limits to the range of the data.

AN!1 !X puts *AT AC into its 'matrix' axes mode. The coordinate system origin is

at the upper left corner. The i axis is vertical and is numbered from top to bottom. The

B axis is hori$ontal and is numbered from left to right.

AN!1 N6 puts *AT AC into its default '%artesian' axes mode. The coordinate

system origin is at the lower left corner. The x axis is hori$ontal and is numbered from

left to right. The y axis is vertical and is numbered from bottom to top.

AN!1 EQWA sets the aspect ratio so that e#ual tick mark increments on the x",y" and

$"axis are e#ual in si$e. This makes 1- E<E(:3) look like a sphere, instead of an

ellipsoid.

AN!1 !*A E is the same as AN!1 EQWA except that the plot box fits tightly around

the data.

AN!1 1QWA<E makes the current axis box s#uare in si$e.

AN!1 08<*A restores the current axis box to full si$e and

removes any restrictions on the scaling of the units.

This undoes the effects of AN!1 1QWA<E and AN!1 EQWA .AN!1 ?!1; free$es aspect ratio properties to enable rotation of ;" obBects and

overrides stretch"to"fill.

AN!1 8&& turns off all axis labeling, tick marks and background.

AN!1 80 turns axis labeling, tick marks and background back on.

;:

Page 33: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 33/47

I-A<*SC 9 1cale data and display as image.

!*A E1%(...) is the same as !*A E(...) except the data is scaled to use the full

colormap.

!*A E1%(...,% !*) where % !* H % 8/ % ! U can specify the scaling.

CO:OR0AR 9 isplay color bar (color scale).

%8 8<CA<(]vert]) appends a vertical color scale to the current axes.

%8 8<CA<(]hori$]) appends a hori$ontal color scale.

%8 8<CA<( ) places the colorbar in the axes . The colorbar will be hori$ontal if

the axes width ^ height (in pixels).

%8 8<CA< without arguments either adds a new vertical color scale or updates an

existing colorbar. H %8 8<CA<(...) returns a handle to the colorbar axes.

%8 8<CA<(...,]peer],AN) creates a colorbar associated with axes AN instead of the

current axes.

<*T; et obBect properties.

? H ET( ,]-roperty0ame]) returns the value of the specified

property for the graphics obBect with handle . !f is a vector of handles, then get

will return an *"by" cell array of values where * is e#ual to length( ). !f

]-roperty0ame] is replaced by a "by"0 or 0"by" cell array of strings containing

property names, then ET will return an *"by"0 cell array of values.

ET( ) displays all property names and their current values for the graphics obBect

with handle .

? H ET( ) where is a scalar, returns a structure where each field name is the name

of a property of and each field contains the value of that property.

? H ET(D, ]&actory])

? H ET(D, ]&actoryG8bBectType^])

? H ET(D, ]&actoryG8bBectType^G-roperty0ame^])

returns for all obBect types the factory values of all properties which have user"settable

default values.

? H ET( , ] efault])

;;

Page 34: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 34/47

? H ET( , ] efaultG8bBectType^])

? H ET( , ] efaultG8bBectType^G-roperty0ame^])

returns information about default property values ( must be scalar). ] efault] returns a

list of all default property values currently set on . ] efaultG8bBectType^] returns

only the defaults for properties of G8bBectType^ set on .

] efaultG8bBectType^G-roperty0ame^] returns the default value for the specific

property, by searching the defaults set on and its ancestors, until that default is found.

!f no default value for this property has been set on or any ancestor of up through

the root, then the factory value for that property is returned.

efaults can not be #ueried on a descendant of the obBect, or on the obBect itself " for

example, avalue for efaultAxes%olor] can not be #ueried on an axes or an axes

child, but can be #ueried on a figure or on the root. /hen using the ]&actory] or

] efault] ET, if -roperty0ame is omitted then the return value will take the form of astructure in which each field name is a property name and the corresponding value is

the value of that property.!f -roperty0ame is specified then a matrix or string value

will be returned.

S*T; 1et obBect properties.

1ET( ,]-roperty0ame],-roperty?alue) sets the value of the specified property for the

graphics obBect with handle . can be a vector of handles, in which case 1ET sets the

properties] values for all the obBects. 1ET( ,a) where a is a structure whose field names

are obBect property names, sets the properties named in each field name with the values

contained in the structure.

1ET( ,pn,pv) sets the named properties specified in the cell array of strings pn to the

corresponding values in the cell array pv for all obBects specified in . The cell array

pn must be "by"0, but the cell array pv can be *"by"0 where * is e#ual to length( )

so that each obBect will be updated with a different set of values for the list of property

names contained in pn.

1ET( ,]-roperty0ame ],-roperty?alue ,]-roperty0ame:],-roperty?alue:,...) sets

multiple property values with a single statement. 0ote that it is permissible to use

property value string pairs, structures, and property value cell array pairs in the same

call to 1ET.

A H 1ET( , ]-roperty0ame])

;@

Page 35: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 35/47

1ET( ,]-roperty0ame])

returns or displays the possible values for the specified property of the obBect with

handle . The returned array is a cell array of possible value strings or an empty cell

array if the property does not have a finite set of possible string values.

A H 1ET( )

1ET( )

returns or displays all property names and their possible values for the obBect with

handle . The return value is a structure whose field names are the property names of

, and whose values are cell arrays of possible property values or empty cell arrays.

The default value for an obBect property can be set on any of an obBect]s ancestors by

setting the -roperty0ame formed by concatenating the string ] efault], the obBect type,

and the property name. &or example, to set the default color of text obBects to red in

the current figure window9set(gcf,] efaultText%olor],]red])

efaults can not be set on a descendant of the obBect, or on the

obBect itself " for example, a value for ] efaultAxes%olor] can not be set on an axes or

an axes child, but can be set on a figure or on the root.

Three strings have special meaning for -roperty?alues9

]default] " use default value (from nearest ancestor) ]factory] " use factory default value

]remove] " remove default value.

ROUND; <ound towards nearest integer.

<8W0 (N) rounds the elements of N to the nearest integers.

SI>*; 1i$e of array.

H 1!YE(N), for *"by"0 matrix N, returns the two"element row vector H *, 0U

containing the number of rows and olumns in the matrix. &or 0" arrays,1!YE(N)

returns a "by"0 vector of dimension lengths.Trailing singleton dmensions are

ignored.

*,0U H 1!YE(N) for matrix N, returns the number of rows and columns in N as

separate output variables.

* ,*:,*;,...,*0U H 1!YE(N) returns the si$es of the first 0

;3

Page 36: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 36/47

dimensions of array N. !f the number of output arguments 0 does not e#ual

0 !*1(N), then for9

0 ^ 0 !*1(N), si$e returns ones in the 'extra' variables,

i.e., outputs 0 !*1(N)5 through 0.

0 G 0 !*1(N), *0 contains the product of the si$es of the remaining dimensions,

i.e., dimensions 05 through 0 !*1(N).

* H 1!YE(N, !*) returns the length of the dimension specified by the scalar !*.

&or example, 1!YE(N, ) returns the number of rows. /hen 1!YE is applied to a Xava

array, the number of rows returned is the length of the Xava array and the number of

columns is always . /hen 1!YE is applied to a Xava array of arrays, the result

describes only the top level array in the array of arrays.

S&RINT) 9 /rite formatted data to string.1,E<<*1 U H 1-<!0T&(&8<*AT,A,...) formats the data in the real part of matrix A

(and in any additional matrix arguments), under control of the specified &8<*AT

string, and returns it in the *AT AC string variable 1. E<<*1 is an optional output

argument that returns an error message string if an error occurred or an empty matrix if

an error did not occur. 1-<!0T& is the same as &-<!0T& except that it returns the data

in a *AT AC string variable rather than writing it to a file.

&8<*AT is a string containing % language conversion specifications. %onversion

specifications involve the character 4, optional flags, optional width and precision

fields, optional subtype specifier, and conversion characters d, i, o, u, x, N, f,e,E, g, ,

c, and s.

The special formats `n,`r,`t,`b,`f can be used to produce linefeed,carriage return, tab,

backspace, and formfeed characters respectively.

Wse `` to produce a backslash character and 44 to produce the percentcharacter.

1-<!0T& behaves like A01! % with certain exceptions and extensions.These include9

A01! % re#uires an integer cast of a double argument to correctly use an integer

conversion specifier like d. A similiar conversion is re#uired when using such a

specifier with non"integral *AT AC values. Wse &!N, & 88<, %E! or <8W0 on a

double argument to explicitly convert non"integral *AT AC values to integral values

;

Page 37: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 37/47

if you plan to use an integer conversion specifier like d.8therwise, any non"integral

*AT AC values will be outputted using the format where the integer conversion

specifier letter has been replaced by the following non"standard subtype specifiers are

supported for conversion characters o, u, x, and N.

T " The underlying % datatype is a float rather than an unsigned integer.

b" The underlying % datatype is a double rather than an unsigned integer.

&or example, to print out in hex a double value use a format like ]4bx].

1-<!0T& is 'vectori$ed' for the case when A is nonscalar. The format string is

recycled through the elements of A (columnwise) until all the elements are used up. !t

is then recycled in a similar manner through any additional matrix arguments.

Examples9

sprintf(]4D.3g],( 5s#rt(3)) :) . 2

sprintf(]4D.3g], eps) @.3D; e5 3sprintf(]4 3.3f], eps) @3D;377 :F;FD@7 .DDDDD

sprintf(]4d],round(pi)) ;

sprintf(]4s],]hello]) hello

sprintf(]The array is 4dx4d.],:,;) The array is :x;.

sprintf(]`n]) is the line termination character on all platforms.

-*:)0; etermine matrix for a mel"spaced filterbank

!nputs9 p number of filters in filterbank n length of fft fs sample rate in $

8utputs9 x a (sparse) matrix containing the filterbank amplitudes

si$e(x) H p, 5floor(n :)U

Wsage9 &or example, to compute the mel"scale spectrum of a

colum"vector signal s, with length n and sample rate fs9

f H fft(s)+

m H melfb(p, n, fs)+

n: H 5 floor(n :)+

$ H m _ abs(f( 9n:)). :+

$ would contain p samples of the desired mel"scale spectrum

To plot filterbanks e.g.9

plot(linspace(D, ( :3DD :), :7), melfb(:D, :3 , :3DD)]),

;F

Page 38: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 38/47

title(]*el"spaced filterbank]), xlabel(]&re#uency ( $)])+

A0S; Absolute value.

AC1(N) is the absolute value of the elements of N. /hen N is complex, AC1(N) is the

complex modulus (magnitude) of the elements of N.

&:OT; inear plot.

- 8T(N,6) plots vector 6 versus vector N. !f N or 6 is a matrix,then the vector is

plotted versus the rows or columns of the matrix,whichever line up. !f N is a scalar and

6 is a vector, length(6)disconnected points are plotted. - 8T(6) plots the columns of

6 versus their index. !f 6 is complex, - 8T(6) is e#uivalent to

8T(real(6),imag(6)). !n all other uses of - 8T, the imaginary part is ignored.

?arious line types, plot symbols and colors may be obtained with - 8T(N,6,1) where1 is a character string made from one element from any or all the following ; columns9

b blue . point " solid

g green o circle 9 dotted

r red x x"mark ". dashdot

c cyan 5 plus "" dashed

m magenta _ star

y yellow s s#uare

k black d diamond

v triangle (down)

triangle (up)

G triangle (left)

^ triangle (right)

p pentagram

h hexagram

&or example, - 8T(N,6,]c59]) plots a cyan dotted line with a plus at each data point+

- 8T(N,6,]bd]) plots blue diamond at each data point but does not draw any line.

- 8T(N ,6 ,1 ,N:,6:,1:,N;,6;,1;,...) combines the plots defined by the (N,6,1)

triples, where the N]s and 6]s are vectors or matrices and the 1]s are strings.

&or example, - 8T(N,6,]y"],N,6,]go]) plots the data twice, with a solid yellow line

interpolating green circles at the data points. The - 8T command, if no color is

;2

Page 39: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 39/47

specified, makes automatic use of the colors specified by the axes %olor8rder property.

The default %olor8rder is listed in the table above for color systems where the default

is blue for one line, and for multiple lines, to cycle through the first six colors in the

table. &or monochrome systems, - 8T cycles over the axes ine1tyle8rder property.

- 8T returns a column vector of handles to !0E obBects, one handle per line.

SU0&:OT 9 %reate axes in tiled positions.

H 1WC- 8T(m,n,p), or 1WC- 8T(mnp), breaks the &igure window into an m"by"n

matrix of small axes, selects the p"th axes for for the current plot, and returns the axis

handle. The axes are counted along the top row of the &igure window, then the second

row, etc. &or example,

1WC- 8T(:, , ), - 8T(income)1WC- 8T(:, ,:), - 8T(outgo)

plots income on the top half of the window and outgo on the bottom half.

1WC- 8T(m,n,p), if the axis already exists, makes it current.

1WC- 8T(m,n,p,]replace]), if the axis already exists, deletes it and creates a new axis.

1WC- 8T(m,n,-), where - is a vector, specifies an axes position that covers all the

subplot positions listed in -.

1WC- 8T( ), where is an axis handle, is another way of making an axis current for

subse#uent plotting commands.

1WC- 8T(]position], left bottom width heightU) creates an axis at the specified

position in normali$ed coordinates (in the range from D.D to .D).

!f a 1WC- 8T specification causes a new axis to overlap an existing axis, the existing

axis is deleted " unless the position of the new and existing axis are identical. &or

example, the statement 1WC- 8T( ,:, ) deletes all existing axes overlapping the left

side of the &igure window and creates a new axis on that side " unless there is an axes

there with a position that exactly matches the position of the new axes (and ]replace]

was not specified), in which case all other overlapping axes will be deleted and the

matching axes will become the current axes.

1WC- 8T( ) is an exception to the rules above, and is not

identical in behavior to 1WC- 8T( , , ). &or reasons of backwards compatibility, it is

a special case of subplot which does not immediately create an axes, but instead sets up

;7

Page 40: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 40/47

the figure so that the next graphics command executes % & <E1ET in the figure

(deleting all children of the figure), and creates a new axes in the default position. This

syntax does not return a handle, so it is an error to specify a return argument. The

delayed % & <E1ET is accomplished by setting the figure]s 0ext-lot to ]replace].

=O:D; old current graph.

8 80 holds the current plot and all axis properties so that subse#uent graphing

commands add to the existing graph.

8 8&& returns to the default mode whereby - 8T commands erase the previous

plots and reset all axis properties before drawing new plots.

8 , by itself, toggles the hold state.

8 does not affect axis autoranging properties.

Algorithm note9

8 80 sets the 0ext-lot property of the current figure and axes to 'add'.

8 8&& sets the 0ext-lot property of the current axes to 'replace'.

46:0<; ?ector #uanti$ation using the inde"Cu$o" ray algorithme

!nputs9 d contains training data vectors (one per column) k is number of centroids

re#uired

8utput9 r contains the result ?Q codebook (k columns, one for each centroids)

:*<*ND 9 raph legend.

E E0 (string ,string:,string;, ...) puts a legend on the current plot using the

specified strings as labels. E E0 works on line graphs, bar graphs, pie graphs,

ribbon plots, etc. 6ou can label any solid"colored patch or surface obBect. The fontsi$e

and fontname for the legend strings matches the axes fontsi$e and fontname.

E E0 ( ,string ,string:,string;, ...) puts a legend on the plot containing the handles

in the vector using the specified strings as labels for the corresponding handles.

E E0 (*), where * is a string matrix or cell array of strings, and E E0 ( ,*)

where is a vector of handles to lines and patches also works.

E E0 (AN,...) puts a legend on the axes with handle AN.

@D

Page 41: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 41/47

E E0 8&& removes the legend from the current axes.

E E0 (AN,]off]) removes the legend from the axis AN.

E E0 ! E makes legend invisible.

E E0 (AN,]hide]) makes legend on axis AN invisible.

E E0 1 8/ makes legend visible.

E E0 (AN,]show]) makes legend on axis AN visible.

E E0 C8N8&& sets appdata property legendboxon to ]off] making legend

background box invisible when the legend is visible.

E E0 (AN,]boxoff]) sets appdata property legendboxon to ]off for axis AN making

legend background box invisible when the legend is visible.

E E0 C8N80 sets appdata property legendboxon to ]on] making legend

background box visible when the legend is visible.

E E0 (AN,]boxon]) sets appdata property legendboxon to ]off for axis AN makinglegend background box visible when the legend is visible.

E H E E0 returns the handle to legend on the current axes or empty if none

exists.

E E0 with no arguments refreshes all the legends in the current figure (if any).

E E0 ( E ) refreshes the specified legend.

E E0 (...,-os) places the legend in the specified location9

D H Automatic 'best' placement (least conflict with data)

H Wpper right"hand corner (default)

: H Wpper left"hand corner

; H ower left"hand corner

@ H ower right"hand corner

" H To the right of the plot

To move the legend, press the left mouse button on the legend and drag to the desired

location. ouble clicking on a label allows you to edit the label.E ,8CX ,8WT ,8WT*U H E E0 (...) returns a handle E to the legend

axes+ a vector 8CX containing handles for the text, lines, and patches in the legend+ a

vector 8WT of handles to the lines and patches in the plot+ and a cell array 8WT*

containing the text in the legend.

@

Page 42: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 42/47

E E0 will try to install a <esi$e&cn on the figure if it hasn]t been defined before.

This resi$e function will try to keep the legend the same si$e.

Examples9

x H D9.:9 :+

plot(x,bessel( ,x),x,bessel(:,x),x,bessel(;,x))+

legend(]&irst],]1econd],]Third])+

legend(]&irst],]1econd],]Third]," )

b H bar(rand( D,3),]stacked])+

colormap(summer)+

hold on

x H plot( 9 D,3_rand( D, ),]marker],]s#uare],]markersi$e], :,...

]markeredgecolor],]y],]markerfacecolor], . D . U,... ]linestyle],]"],]color],]r],]linewidth],:)+hold off legend( b,xU, ]%arrots], ]-eas],] -eppers],] reen Ceans],...

]%ucumbers],]Eggplant])

1peaker <ecognition9 Testing 1tage

!nput9

testdir 9 string name of directory contains all test sound files

n 9 number of test files in testdir

code 9 codebooks of all trained speakers

0ote9

1ound files in testdir is supposed to be9

s .wav, s:.wav, ..., sn.wav

Example9

^^ test(]%9`data`amintest`], 2, code)+

-)CC

!nputs9 s contains the signal to anali$e fs is the sampling rate of the signal

8utput9 r contains the transformed signal

1peaker <ecognition9 Training 1tage

!nput9

traindir 9 string name of directory contains all train sound files

@:

Page 43: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 43/47

n 9 number of train files in traindir

8utput9

code 9 trained ?Q codebooks, codeLiM for i"th speaker

Example9

^^ code H train(]%9`data`amintrain`], 2)+

DIST*U 9 -airwise Euclidean distances between columns of two matrices

!nput9

x, y9 Two matrices whose each column is an a vector data.

8utput9

d9 Element d(i,B) will be the Euclidean distance between two

column vectors N(9,i) and 6(9,B)

0ote9The Euclidean distance between two vectors N and 6 is9

H sum((x"y). :). D.3

YE<81 Yeros array.

YE<81(0) is an 0"by"0 matrix of $eros.

YE<81(*,0) or YE<81( *,0U) is an *"by"0 matrix of $eros.

YE<81(*,0,-,...) or YE<81( * 0 - ...U) is an *"by"0"by"-"by"... array of $eros.

YE<81(1!YE(A)) is the same si$e as A and all $eros.

C*I:; <ound towards plus infinity.

%E! (N) rounds the elements of N to the nearest integers towards infinity.

-IN ; 1mallest component.

&or vectors, *!0(N) is the smallest element in N. &or matrices, *!0(N) is a row

vector containing the minimum element from each column. &or 0" arrays, *!0(N)

operates along the first non"singleton dimension. 6,!U H *!0(N) returns the indices of

the minimum values in vector !.!f the values along the first non"singleton dimension

contain more than one minimal element, the index of the first one is

returned.*!0(N,6) returns an array the same si$e as N and 6 with the smallest

@;

Page 44: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 44/47

elements taken from N or 6. Either one can be a scalar. 6,!U H *!0(N, U, !*) operates

along the dimension !*.

/hen complex, the magnitude *!0(AC1(N)) is used, and the angle A0 E(N) is

ignored. 0a0]s are ignored when computing the minimum.

S&ARS* 9 %reate sparse matrix.

1 H 1-A<1E(N) converts a sparse or full matrix to sparse form by s#uee$ing out any

$ero elements.

1 H 1-A<1E(i,B,s,m,n,n$max) uses the rows of i,B,sU to generate an m"by"n sparse

matrix with space allocated for n$max non$eros.The two integer index vectors, i and B,

and the real or complex entries vector, s, all have the same length, nn$, which is the

number of non$eros in the resulting sparse matrix 1 . Any elements of s which have

duplicate values of i and B are added together.There are several simplifications of this six argument call.

1 H 1-A<1E(i,B,s,m,n) uses n$max H length(s).

1 H 1-A<1E(i,B,s) uses m H max(i) and n H max(B).

1 H 1-A<1E(m,n) abbreviates 1-A<1E( U, U, U,m,n,D). This generates the ultimate

sparse matrix, an m"by"n all $ero matrix.The argument s and one of the arguments i or B

may be scalars,in which case they are expanded so that the first three arguments all

have the same length. &or example, this dissects and then reassembles a sparse matrix9

i,B,sU H find(1)+

m,nU H si$e(1)+

1 H sparse(i,B,s,m,n)+

1o does this, if the last row and column have non$ero entries9

i,B,sU H find(1)+

1 H sparse(i,B,s)+

All of *AT AC]s built"in arithmetic, logical and indexing operations can be applied to

sparse matrices, or to mixtures of sparse and full matrices. 8perations on sparse

matrices return sparse matrices and operations on full matrices return full matrices. !n

most cases,operations on mixtures of sparse and full matrices return full matrices. The

exceptions include situations where the result of a mixed operation is structurally

sparse, eg. A ._ 1 is at least as sparse as 1 . 1ome operations, such as 1 ^H D, generate

@@

Page 45: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 45/47

'Cig 1parse', or 'C1', matrices "" matrices with sparse storage organi$ation but few

$ero elements.

-)CC;

!nputs9 s contains the signal to anali$e fs is the sampling rate of the signal

8utput9 r contains the transformed signal

DCT ; iscrete cosine transform.

6 H %T(N) returns the discrete cosine transform of N.The vector 6 is the same si$e as

N and contains the discrete cosine transform coefficients. 6 H %T(N,0) pads or

truncates the vector N to length 0 before transforming.!f N is a matrix, the %T

operation is applied to each column. This transform can be inverted using ! %T.

-*AN 9 Average or mean value.

&or vectors, *EA0(N) is the mean value of the elements in N. &or matrices,

*EA0(N) is a row vector containing the mean value of each column. &or 0" arrays,

*EA0(N) is the mean value of the elements along the first non"singleton dimension of

N.*EA0(N, !*) takes the mean along the dimension !* of N.

Example9 !f N H D : ; @ 3U

then mean(N, ) is .3 :.3 ;.3U and mean(N,:) is @U

Disad antages o' -AT:A0 9

*AT AC has two principle disadvantages. The first is an interpreted language and

therefore can execute more slowly than compiled languages. This problem can mitigate

by properly structuring the *AT AC program nand by the use of the *AT AC compiler

to compile the final *AT AC program before distribution and general use.

The second disadvantage is cost 9 a fully copy of *AT AC is 3 to D times more

expensive than a conventional % or fortan compiler

@3

Page 46: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 46/47

.%80% W1!80

1peaker recognition is the process of automatically recogni$ing who is speaking on the basis of individual information included in speech waves. This techni#ue makes it possible to use the speaker>s voice to verify their identity and control access to services

such as voice dialing, banking by telephone, telephone shopping etcZThis proBect dealt with speaker independent system which is developed to operate for any

speaker of a particular type ( e.g. American English ) . These systems are the mostdifficult to develop, most expensive and accuracy is lower than speaker dependent

systems. owever, they are more flexible.Cy applying the procedure described above, for each speech frame of around ;Dmsecwith overlap, a set of mel"fre#uency cepstrum coefficients is computed.These are result of a cosine transform of the logarithm of the short term power

spectrum expressed on a mel" fre#uency scale. This set of coefficients is called anacoustic vector. Therefore each input utterance is transformed into a se#uence of acousticvectors. This proBect also discussed how those acoustic vectors can be used to representand recogni$e the voice characteristic of the speaker.

0I0I:IO<RA&=?

.<. <abiner and C. . Xuang, &undamentals of 1peech

<ecognition, -rentice" all, Englewood %liffs, 0.X., 77;.

@

Page 47: Report 123

7/21/2019 Report 123

http://slidepdf.com/reader/full/report-123-56d9d3a504823 47/47

.< <abiner and <./. 1chafer, igital -rocessing of 1peech 1ignals, -rentice"

all, Englewood %liffs, 0.X., 7F2.

6. inde, A. Cu$o <. ray, RAn algorithm for vector #uanti$er designS,

!EEE Transactions on %ommunications, ?ol. :2, pp.2@"73, 72D.

1. &urui, R1peaker independent isolated word recognition using dynamic

features of speech spectrumS, !EEE Transactions on Acoustic, 1peech, 1ignal -rocessing,

?ol. A11-";@, 0o. , pp. 3:"37, &ebruary 72 .

1. &urui, RAn overview of speaker recognition technologyS, E1%A /orkshop

on Automatic 1peaker <ecognition, !dentification and ?erification, pp. "7, 77@.

&.P. 1ong, A.E. <osenberg and C. . Xuang, RA vector #uantisation approach to

speaker recognitionS, AT T Technical Xournal, ?ol. ":, pp. @": , *arch 72F.

comp.speech &re#uently Asked Questions /// site,

http9 svr"www.eng.cam.ac.uk comp.speech