automatic speaker recognition system

Digital Signal Processing Mini-Project“ An Automatic Speaker Recognition System”

Minh N. DoAudio Visual Communications Laboratory

Swiss Federal Institute of Technology, Lausanne, Switzerland

1 Overview

Speaker recognition is the process of automatically recognizing who is speakingon the basis of individual information included in speech waves. This techniquemakes it possible to use the speaker’s voice to verify their identity and control accessto services such as voice dialing, banking by telephone, telephone shopping, databaseaccess services, information services, voice mail, security control for confidentialinformation areas, and remote access to computers.

The goal of this project is to build a simple, yet complete and representativeautomatic speaker recognition system. Due to the limited space, we will only test oursystem on a very small (but already non-trivial) speech database. There are 8 femalespeakers, labeled from S1 to S8. All speakers uttered the same single digit "zero"once in a training session and once in a testing session later on. Those sessions are atleast 6 months apart to simulate the voice variation over the time. The vocabulary ofdigit is used very often in testing speaker recognition because of its applicability tomany security applications. For example, users have to speak a PIN (PersonalIdentification Number) in order to gain access to the laboratory door, or users have tospeak their credit card number over the telephone line. By checking the voicecharacteristics of the input utterance using an automatic speaker recognition systemsimilar to the one the we will develop, the system is able to add an extra level ofsecurity.

2 Principles of Speaker Recognition

Speaker recognition can be classified into identification and verification. Speakeridentification is the process of determining which registered speaker provides a givenutterance. Speaker verification, on the other hand, is the process of accepting orrejecting the identity claim of a speaker. Figure 1 shows the basic structures ofspeaker identification and verification systems.

Speaker recognition methods can also be divided into text-independent and text-dependent methods. In a text-independent system, speaker models capturecharacteristics of somebody’s speech which show up irrespective of what one issaying. In a text-dependent system, on the other hand, the recognition of thespeaker’s identity is based on his or her speaking one or more specific phrases, likepasswords, card numbers, PIN codes, etc.

All technologies of speaker recognition, identification and verification, text-independent and text-dependent, each have its own advantages and disadvantages andmay requires different treatments and techniques. The choice of which technology touse is application-specific.

At the highest level, all speaker recognition systems contain two main modules(refer to Figure 1): feature extraction and feature matching. Feature extraction is theprocess that extracts a small amount of data from the voice signal that can later beused to represent each speaker. Feature matching involves the actual procedure toidentify the unknown speaker by comparing extracted features from his/her voiceinput with the ones from a set of known speakers. We will discuss each module indetail in later sections.

(a) Speaker identification

(b) Speaker verification

Figure 1. Basic structures of speaker recognition systems

Inputspeech

Featureextraction

Referencemodel

(Speaker #1)

Similarity

Referencemodel

(Speaker #N)

Similarity

Maximumselection

Identificationresult

(Speaker ID)

Referencemodel

(Speaker #M)

SimilarityInput

speechFeature

extraction

Verificationresult

(Accept/Reject)Decision

ThresholdSpeaker ID(#M)

All speaker recognition systems have to serve two distinguish phases. The firstone is referred to the enrollment sessions or training phase while the second one isreferred to as the operation sessions or testing phase. In the training phase, eachregistered speaker has to provide samples of their speech so that the system can buildor train a reference model for that speaker. In case of speaker verification systems, inaddition, a speaker-specific threshold is also computed from the training samples.During the testing (operational) phase (see Figure 1), the input speech is matchedwith stored reference model(s) and recognition decision is made.

Speaker recognition is a difficult task and it is still an active research area.Automatic speaker recognition works based on the premise that a person’s speechexhibits characteristics that are unique to the speaker. However this task has beenchallenged by the highly variant of input speech signals. The principle source ofvariance comes form the speakers themselves. Speech signals in training and testingsessions can be greatly different due to many facts such as people voice change withtime, health conditions (e.g. the speaker has a cold), speaking rates, etc. There arealso other factors, beyond speaker variability, that present a challenge to speakerrecognition technology. Examples of these are acoustical noise and variations inrecording environments (e.g. speaker uses different telephone handsets).

3 Speech Feature Extraction

3.1 Introduction

The purpose of this module is to convert the speech waveform to some type ofparametric representation (at a considerably lower information rate) for furtheranalysis and processing. This is often referred as the signal-processing front end.

The speech signal is a slowly timed varying signal (it is called quasi-stationary).An example of speech signal is shown in Figure 2. When examined over asufficiently short period of time (between 5 and 100 msec), its characteristics arefairly stationary. However, over long periods of time (on the order of 1/5 seconds ormore) the signal characteristic change to reflect the different speech sounds beingspoken. Therefore, short-time spectral analysis is the most common way tocharacterize the speech signal.

Figure 2. An example of speech signal

A wide range of possibilities exist for parametrically representing the speechsignal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the bestknown and most popular, and these will be used in this project.

MFCC’s are based on the known variation of the human ear’s critical bandwidthswith frequency, filters spaced linearly at low frequencies and logarithmically at highfrequencies have been used to capture the phonetically important characteristics ofspeech. This is expressed in the mel-frequency scale, which is a linear frequencyspacing below 1000 Hz and a logarithmic spacing above 1000 Hz. The process ofcomputing MFCCs is described in more detail next.

3.2 Mel-frequency cepstrum coefficients processor

A block diagram of the structure of an MFCC processor is given in Figure 3. Thespeech input is typically recorded at a sampling rate above 10000 Hz. This samplingfrequency was chosen to minimize the effects of aliasing in the analog-to-digitalconversion. These sampled signals can capture all frequencies up to 5 kHz, whichcover most energy of sounds that are generated by humans. As been discussedpreviously, the main purpose of the MFCC processor is to mimic the behavior of thehuman ears. In addition, rather than the speech waveforms themselves, MFFC’s areshown to be less susceptible to mentioned variations.

0 5 10 15 20 25 30 35 40-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Figure 3. Block diagram of the MFCC processor

3.2.1 Frame Blocking

In this step the continuous speech signal is blocked into frames of N samples,with adjacent frames being separated by M (M < N). The first frame consists of thefirst N samples. The second frame begins M samples after the first frame, andoverlaps it by N - M samples. Similarly, the third frame begins 2M samples after thefirst frame (or M samples after the second frame) and overlaps it by N - 2M samples.This process continues until all the speech is accounted for within one or moreframes. Typical values for N and M are N = 256 (which is equivalent to ~ 30 msecwindowing and facilitate the fast radix-2 FFT) and M = 100.

3.2.2 Windowing

The next step in the processing is to window each individual frame so as tominimize the signal discontinuities at the beginning and end of each frame. Theconcept here is to minimize the spectral distortion by using the window to taper thesignal to zero at the beginning and end of each frame. If we define the window as

10),( −≤≤ Nnnw , where N is the number of samples in each frame, then the resultof windowing is the signal

10),()()( −≤≤= Nnnwnxny ll

Typically the Hamming window is used, which has the form:

10,1

2cos46.054.0)( −≤≤

−−= Nn

N

nnw

π

3.2.3 Fast Fourier Transform (FFT)

The next processing step is the Fast Fourier Transform, which converts eachframe of N samples from the time domain into the frequency domain. The FFT is afast algorithm to implement the Discrete Fourier Transform (DFT) which is definedon the set of N samples { xn} , as follow:

melcepstrum

melspectrum

framecontinuousspeech

��

�� spectrum

�� !�"��#�$�� &%��'�'��

( ��'�)&*!�"$��

∑−

=

− −==1

0

/2 1,...,2,1,0,N

k

Njknkn NnexX π

Note that we use j here to denote the imaginary unit, i.e. 1−=j . In general Xn’sare complex numbers. The resulting sequence { Xn} is interpreted as follow: the zerofrequency corresponds to n = 0, positive frequencies 2/0 sFf << correspond to

values 12/1 −≤≤ Nn , while negative frequencies 02/ <<− fFs correspond to

112/ −≤≤+ NnN . Here, Fs denotes the sampling frequency.

The result obtained after this step is often referred to as signal’s spectrum orperiodogram.

3.2.4 Mel-frequency Wrapping

As mentioned above, psychophysical studies have shown that human perceptionof the frequency contents of sounds for speech signals does not follow a linear scale.Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch ismeasured on a scale called the ‘mel’ scale. The mel-frequency scale is a linearfrequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. As areference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearingthreshold, is defined as 1000 mels. Therefore we can use the following approximateformula to compute the mels for a given frequency f in Hz:

)700/1(log*2595)( 10 ffmel +=

One approach to simulating the subjective spectrum is to use a filter bank, onefilter for each desired mel-frequency component (see Figure 4). That filter bank has atriangular bandpass frequency response, and the spacing as well as the bandwidth isdetermined by a constant mel-frequency interval. The modified spectrum of S(ω) thusconsists of the output power of these filters when S(ω) is the input. The number ofmel spectrum coefficients, K, is typically chosen as 20.

Note that this filter bank is applied in the frequency domain, therefore it simplyamounts to taking those triangle-shape windows in the Figure 4 on the spectrum. Auseful way of thinking about this mel-wrapping filter bank is to view each filter as anhistogram bin (where bins have overlap) in the frequency domain.

Figure 4. An example of mel-spaced filterbank

3.2.5 Cepstrum

In this final step, we convert the log mel spectrum back to time. The result iscalled the mel frequency cepstrum coefficients (MFCC). The cepstral representationof the speech spectrum provides a good representation of the local spectral propertiesof the signal for the given frame analysis. Because the mel spectrum coefficients (andso their logarithm) are real numbers, we can convert them to the time domain usingthe Discrete Cosine Transform (DCT). Therefore if we denote those mel power

spectrum coefficients that are the result of the last step are KkSk ,...,2,1,~ = , we can

calculate the MFCC’s, ,~nc as

KnK

knScK

kkn ,...,2,1,

21

cos)~(log~1

=

−= ∑

=

π

Note that we exclude the first component, ,~0c from the DCT since it represents

the mean value of the input signal which carried little speaker specific information.

3.3 Summary

By applying the procedure described above, for each speech frame of around30msec with overlap, a set of mel-frequency cepstrum coefficients is computed.

0 1000 2000 3000 4000 5000 6000 70000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2Mel-spaced filterbank

Frequency (Hz)

These are result of a cosine transform of the logarithm of the short-term powerspectrum expressed on a mel-frequency scale. This set of coefficients is called anacoustic vector. Therefore each input utterance is transformed into a sequence ofacoustic vectors. In the next section we will see how those acoustic vectors can beused to represent and recognize the voice characteristic of the speaker.

4 Feature Matching

4.1 Introduction

The problem of speaker recognition belongs to a much broader topic in scientificand engineering so called pattern recognition. The goal of pattern recognition is toclassify objects of interest into one of a number of categories or classes. The objectsof interest are generically called patterns and in our case are sequences of acousticvectors that are extracted from an input speech using the techniques described in theprevious section. The classes here refer to individual speakers. Since theclassification procedure in our case is applied on extracted features, it can be alsoreferred to as feature matching.

Furthermore, if there exists some set of patterns that the individual classes ofwhich are already known, then one has a problem in supervised pattern recognition.This is exactly our case since during the training session, we label each input speechwith the ID of the speaker (S1 to S8). These patterns comprise the training set andare used to derive a classification algorithm. The remaining patterns are then used totest the classification algorithm; these patterns are collectively referred to as the testset. If the correct classes of the individual patterns in the test set are also known, thenone can evaluate the performance of the algorithm.

The state-of-the-art in feature matching techniques used in speaker recognitioninclude Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), andVector Quantization (VQ). In this project, the VQ approach will be used, due to easeof implementation and high accuracy. VQ is a process of mapping vectors from alarge vector space to a finite number of regions in that space. Each region is called acluster and can be represented by its center called a codeword. The collection of allcodewords is called a codebook.

Figure 5 shows a conceptual diagram to illustrate this recognition process. In thefigure, only two speakers and two dimensions of the acoustic space are shown. Thecircles refer to the acoustic vectors from the speaker 1 while the triangles are from thespeaker 2. In the training phase, a speaker-specific VQ codebook is generated foreach known speaker by clustering his/her training acoustic vectors. The resultcodewords (centroids) are shown in Figure 5 by black circles and black triangles forspeaker 1 and 2, respectively. The distance from a vector to the closest codeword of acodebook is called a VQ-distortion. In the recognition phase, an input utterance of anunknown voice is “vector-quantized” using each trained codebook and the total VQdistortion is computed. The speaker corresponding to the VQ codebook with smallesttotal distortion is identified.

Speaker 1

Speaker 1centroidsample

Speaker 2centroidsample

Speaker 2

VQ distortion

Figure 5. Conceptual diagram illustrating vector quantization codebook formation.One speaker can be discriminated from another based of the location of centroids.

(Adapted from Song et al., 1987)

4.2 Clustering the Training Vectors

After the enrolment session, the acoustic vectors extracted from input speech of aspeaker provide a set of training vectors. As described above, the next important stepis to build a speaker-specific VQ codebook for this speaker using those trainingvectors. There is a well-know algorithm, namely LBG algorithm [Linde, Buzo andGray, 1980], for clustering a set of L training vectors into a set of M codebookvectors. The algorithm is formally implemented by the following recursiveprocedure:

1. Design a 1-vector codebook; this is the centroid of the entire set of trainingvectors (hence, no iteration is required here).

2. Double the size of the codebook by splitting each current codebook yn accordingto the rule

)1( ε+=+ ++ ,,

)1( ε−=− -- ..

where n varies from 1 to the current size of the codebook, and ε is a splittingparameter (we choose ε =0.01).

3. Nearest-Neighbor Search: for each training vector, find the codeword in thecurrent codebook that is closest (in terms of similarity measurement), and assignthat vector to the corresponding cell (associated with the closest codeword).

4. Centroid Update: update the codeword in each cell using the centroid of thetraining vectors assigned to that cell.

5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a presetthreshold

6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.

Intuitively, the LBG algorithm designs an M-vector codebook in stages. It startsfirst by designing a 1-vector codebook, then uses a splitting technique on thecodewords to initialize the search for a 2-vector codebook, and continues the splittingprocess until the desired M-vector codebook is obtained.

Figure 6 shows, in a flow diagram, the detailed steps of the LBG algorithm.“Cluster vectors” is the nearest-neighbor search procedure which assigns eachtraining vector to a cluster associated with the closest codeword. “Find centroids” isthe centroid update procedure. “Compute D (distortion)” sums the distances of alltraining vectors in the nearest-neighbor search so as to determine whether theprocedure has converged.

Findcentroid

Split eachcentroid

Clustervectors

Findcentroids

Compute D(distortion)

ε<−D

D’D

Stop

D’ = D

m = 2*m

No

Yes

Yes

Nom < M

Figure 6. Flow diagram of the LBG algorithm (Adapted from Rabiner and Juang, 1993)

5 Implementation

As stated above, in this project we will experience the building and testing of anautomatic speaker recognition system. In order to implement such a system, one mustgo through several steps which were described in details in previous sections. Notethat many of the above tasks are already implemented in Matlab. Furthermore, toease the development process, we supplied you with two utility functions: melfb anddisteu; and two main functions: train and test. Download all of those files intoyour working folder. The first two files can be treated as a black box, but the latertwo needs to be thoroughly understood. In fact, your tasks are to write two missingfunctions: mfcc and vqlbg, which will be called from the given main functions. Inorder to accomplish that, follow each step in this section carefully and answer all thequestions.

5.1 Speech Data

Click here to down-load the ZIP file of the speech database. After unzipping thefile correctly, you will find two folders, TRAIN and TEST, each contains 8 files,named: S1.WAV, S2.WAV, …, S8.WAV; each is labeled after the ID of the speaker.These files were recorded in Microsoft WAV format. In Windows systems, you canlisten to the recorded sounds by double clicking into the files.

Our goal is to train a voice model (or more specific, a VQ codebook in the MFCCvector space) for each speaker S1 - S8 using the corresponding sound file in theTRAIN folder. After this training step, the system would have knowledge of thevoice characteristic of each (known) speaker. Next, in the testing phase, the systemwill be able to identify the (assumed unknown) speaker of each sound file in theTEST folder.

Question 1: Play each sound file in the TRAIN folder. Can you distinguish thevoices of those eight speakers? Now play each sound in the TEST folder in a randomorder without looking at the file name (pretending that you do not known the speaker)and try to identify the speaker using your knowledge of their voices that you justlearned from the TRAIN folder. This is exactly what the computer will do in oursystem. What is your (human performance) recognition rate? Record this result sothat it could be later on compared against the computer performance of our system.

5.2 Speech Processing

In this phase you are required to write a Matlab function that reads a sound fileand turns it into a sequence of MFCC (acoustic vectors) using the speech processingsteps described previously. Many of those tasks are already provided by eitherstandard or our supplied Matlab functions. The Matlab functions that you would needto use are: wavread, hamming, fft, dct and melfb (supplied function). Typehelp function_name at the Matlab prompt for more information about afunction.

Question 2: Read a sound file into Matlab. Check it by playing the sound file inMatlab using the function: sound. What is the sampling rate? What is the highestfrequency that the recorded sound can capture with fidelity? With that sampling rate,how many msecs of actual speech are contained in a block of 256 samples?

Plot the signal to view it in the time domain. It should be obvious that the rawdata in the time domain has a very high amount of data and it is difficult for analyzingthe voice characteristic. So the motivation for this step (speech feature extraction)should be clear now!

Now cut the speech signal (a vector) into frames with overlap (refer to the framesection in the theory part). The result is a matrix where each column is a frame of Nsamples from original speech signal. Applying the steps: Windowing and FFT totransform the signal into the frequency domain. This process is used in manydifferent applications and is referred in literature as Windowed Fourier Transform(WFT) or Short-Time Fourier Transform (STFT). The result is often called as thespectrum or periodogram.

Question 3: After successfully running the preceding process, what is theinterpretation of the result? Compute the power spectrum and plot it out using theimagesc command. Note that it is better to view the power spectrum on the logscale. Locate the region in the plot that contains most of the energy. Translate thislocation into the actual ranges in time (msec) and frequency (in Hz) of the inputspeech signal.

Question 4: Compute and plot the power spectrum of a speech file using differentframe size: for example N = 128, 256 and 512. In each case, set the frame incrementM to be about N/3. Can you describe and explain the differences among thosespectra?

The last step in speech processing is converting the power spectrum into mel-frequency cepstrum coefficients. The supplied function melfb facilitates this task.

Question 5: Type help melfb at the Matlab prompt for more informationabout this function. Follow the guidelines to plot out the mel-spaced filter bank.What is the behavior of this filter bank? Compare it with the theoretical part.

Finally, put all the pieces together into a single Matlab function, mfcc, whichperforms the MFCC processing.

Question 6: Compute and plot the spectrum of a speech file before and after themel-frequency wrapping step. Describe and explain the impact of the melfbprogram.

5.3 Vector Quantization

The result of the last section is that we transform speech signals into vectors in anacoustic space. In this section, we will apply the VQ-based pattern recognition

technique to build speaker reference models from those vectors in the training phaseand then can identify any sequences of acoustic vectors uttered by unknown speakers.

Question 7: To inspect the acoustic space (MFCC vectors) we can pick any twodimensions (say the 5th and the 6th) and plot the data points in a 2D plane. Useacoustic vectors of two different speakers and plot data points in two different colors.Do the data regions from the two speakers overlap each other? Are they in clusters?

Now write a Matlab function, vqlbg that trains a VQ codebook using the LGBalgorithm described before. Use the supplied utility function disteu to compute thepairwise Euclidean distances between the codewords and training vectors in theiterative process.

Question 8: Plot the data points of the trained VQ codewords using the same twodimensions over the plot from the last question. Compare this with Figure 5.

5.4 Simulation and Evaluation

Now is the final part! Use the two supplied programs: train and test (whichrequire two functions mfcc and vqlbg that you just wrote) to simulate the trainingand testing procedure in speaker recognition system, respectively.

Question 9: What is recognition rate our system can perform? Compare this withthe human performance. For the cases that the system makes errors, re-listen to thespeech files and try to come up with some explanations.

Question 10 (optional): You can also test the system with your own speech files.Use the Window’s program Sound Recorder to record more voices from yourself andyour friends. Each new speaker needs to provide one speech file for training and onefor testing. Can the system recognize your voice? Enjoy!

REFERENCES

[1] L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, N.J., 1993.

[2] L.R Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, N.J., 1978.

[3] S.B. Davis and P. Mermelstein, “Comparison of parametric representations formonosyllabic word recognition in continuously spoken sentences” , IEEETransactions on Acoustics, Speech, Signal Processing, Vol. ASSP-28, No. 4,August 1980.

[4] Y. Linde, A. Buzo & R. Gray, “An algorithm for vector quantizer design” , IEEETransactions on Communications, Vol. 28, pp.84-95, 1980.

[5] S. Furui, “Speaker independent isolated word recognition using dynamicfeatures of speech spectrum”, IEEE Transactions on Acoustic, Speech, SignalProcessing, Vol. ASSP-34, No. 1, pp. 52-59, February 1986.

[6] S. Furui, “An overview of speaker recognition technology” , ESCA Workshop onAutomatic Speaker Recognition, Identification and Verification, pp. 1-9, 1994.

[7] F.K. Song, A.E. Rosenberg and B.H. Juang, “A vector quantisation approach tospeaker recognition” , AT&T Technical Journal, Vol. 66-2, pp. 14-26, March1987.

[8] comp.speech Frequently Asked Questions WWW site,http://svr-www.eng.cam.ac.uk/comp.speech/

automatic speaker recognition system

Documents