author guidelines for ieee style formatcsis.pace.edu/~ctappert/cs692-13spring/projects/t1fin.doc ·...

17
Voiceprint Systems CS691 2012 Fall Batch Voiceprint System for Biometric Authentication Geetha Bothe, Rahul Raj, Ravi Ray, Mahesh Sooryambylu, Sreeram Vancheeswaran Computer Science Department, Pace University Abstract Voice based authentication is one of the forms of biometric authentication. This approach will verify an individual speaker against his/her stored voice pattern. In this project we are using a specified common phrase for all users, which are segmented and analyzed using speech processing tools to compare a given voice sample against the corresponding stored sample. The voice samples are converted into a spectrogram, which is then segmented using energy thresholds and compared against a stored voice sample. This paper describes the implementation details of techniques used - Fast Fourier Transformation, Mel frequency bank calculation, Cepstral Normalization, utterance segmentation and dynamic time warping. 1. Introduction The concept of authentication goes hand in hand with most of the security related problems. Authentication means verifying that the user is who he claims to be. We can authenticate an identity in three ways: by something the user knows (such as a password or personal identification number), something the user has (a security token or smart card) or something the user is (a physical characteristic, such as a fingerprint, called a biometric). Biometric authentication has been widely regarded as the most foolproof - or at least the hardest to forge or spoof. Since the early 1980s, systems of identification and authentication based on physical characteristics have been available to enterprise IT. These biometric systems were slow, intrusive and expensive, but because they were mainly used for guarding mainframe access or restricting physical entry to relatively few users, they proved workable in some high-security situations. Twenty years later, computers are much faster and cheaper than ever. This, plus new, inexpensive hardware, has renewed interest in biometrics. Voice recognition is one of the methods of performing biometric authentication. The idea is to verify the individual speaker against a stored voice pattern, not to understand what is being said. Voice print systems are one of the biometric authentication techniques using Voice recognition. Voiceprint systems can base user authentication on four possible types of passphrases: o A user-specified phrase, like the user's name o A specified phrase common to all users o A random phrase that the computer displays on the screen o A random phrase that can vary at the user's discretion. 2. Literature review 1

Upload: vuongmien

Post on 20-Apr-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Voiceprint Systems CS691 2012 Fall Batch

Voiceprint System for Biometric Authentication

Geetha Bothe, Rahul Raj, Ravi Ray, Mahesh Sooryambylu, Sreeram VancheeswaranComputer Science Department, Pace University

AbstractVoice based authentication is one of the forms of biometric authentication. This approach will verify an individual speaker against his/her stored voice pattern. In this project we are using a specified common phrase for all users, which are segmented and analyzed using speech processing tools to compare a given voice sample against the corresponding stored sample. The voice samples are converted into a spectrogram, which is then segmented using energy thresholds and compared against a stored voice sample. This paper describes the implementation details of techniques used - Fast Fourier Transformation, Mel frequency bank calculation, Cepstral Normalization, utterance segmentation and dynamic time warping.

1. Introduction

The concept of authentication goes hand in hand with most of the security related problems. Authentication means verifying that the user is who he claims to be. We can authenticate an identity in three ways: by something the user knows (such as a password or personal identification number), something the user has (a security token or smart card) or something the user is (a physical characteristic, such as a fingerprint, called a biometric). Biometric authentication has been widely regarded as the most foolproof - or at least the hardest to forge or spoof. Since the early 1980s, systems of identification and authentication based on physical characteristics have been available to enterprise IT. These biometric systems were slow, intrusive and expensive, but because they were mainly used for guarding mainframe access or restricting physical entry to relatively few users, they proved workable in some high-security situations. Twenty years later, computers are much faster and cheaper than ever. This, plus new, inexpensive hardware, has renewed interest in biometrics.Voice recognition is one of the methods of performing biometric authentication. The idea is to verify the individual speaker against a stored voice pattern, not to understand what is being said. Voice print systems are one of the biometric authentication techniques using Voice recognition.

Voiceprint systems can base user authentication on four possible types of passphrases:o A user-specified phrase, like the user's nameo A specified phrase common to all users o A random phrase that the computer displays on the

screen o A random phrase that can vary at the user's

discretion.

2. Literature review

As part of literature review, we searched for any relevant study already done in this area and found the following as the best match.

1) Naresh Trilok's M.S. Thesis, 2004 - This paper talks about establishing the uniqueness of the Human Voice for security applications via spectral data manually captured from the wave editor too. We are focusing on achieving the same using Energy and Zero Crossing calculation methods automated using plug-ins developed on top of a wave editor tool.2) Roy Morris, Summer Project, 2012 - This is one of the recent studies on the Voice based authentication systems. In this project, utterances in a voice sample are segmented using energy values calculated manually from spectral data. In our project, we are using both energy as well as zero crossing methods to segment the utterance in an automated way using wave editor plug-ins.

3. Technical Overview

This study will focus primarily on using the specified, specially-designed passphrase, "My Name Is", which is common to all users. The main reasons for using a common authentication phrase are as follows:

It simplifies the segmentation problem. Earlier work indicated that features extracted from the individual phonetic units of an utterance increased authentication performance (Trilok, 2004), and this requires segmenting the utterance into phonetic units. Segmenting one known utterance into its phonetic units is much easier than segmenting many unknown utterances into their phonetic units.

1

Voiceprint Systems CS691 2012 Fall Batch

It allows for the careful selection of the common phrase to optimize the variety of phonetic units for their authentication value. The "My Name Is" phrase contains seven phonetic units: three nasal sounds, the two [m]'s and one [n]; three vowel sounds, [aI], [eI], and [I]; and one fricative [z]. The nasal and vowel sounds characterize the user's nasal and vocal tracts, respectively, and the fricative characterizes the user's teeth and front portion of the mouth.

It facilitates testing for imposters since the common utterance spoken by non-authentic users can be employed as imposter utterances.

It permits the measurement of true voice authentication biometric performance. For example, in the area of keystroke biometrics for password authentication, the manufacturers of the password "hardening" systems typically claim performance levels of 99% or higher. However, Killourhy & Maxion (2009) used a common password to show that the best actual keystroke biometric performance for password authentication is only about 90%.

It avoids potential experimental flaws. The combination of using the same authentication utterance for all users, and one that consists of frequently used words that are easy to pronounce, avoids many of the experimental flaws that Maxion (2011) describes in measuring the performance of biometric systems. A common authentication phrase for a person will constitute distinct voice components called phonemes. A phoneme is the smallest contrastive unit in the sound system of a language. Each phoneme has a pitch, cadence, and inflection. These three give each one of us a unique voice sound. Based on the individual’s vocal tract, including tract length, ratio of larynx to sinuses cadence, pitch, tone, frequency, range, and duration of voice, the frequency values for the voice sample will vary. The similarity in voice comes from cultural and regional influences in the form of accents. The frequency response generated from the common phrase utterance of an individual is called Spectrogram. The spectrogram comprises of Frequency and Amplitude plotted over time. Frequency is the number of times per second a wave repeats in itself (or cycles). The amplitude measures the amount of air pressure variation. Once the individual phonemes or words from the common phrase being uttered by an individual are identified (segmented from the entire sample), the frequency and amplitude values for each such segment is distinct per individual.

Thus, the spectrographic analysis will help us determine the individual phonemes from a phrase as well as distinctly identify the speaker based on these values.Speech processing can be regarded as a special case of digital signal processing (signals are processed in a digital representation), applied to speech signal. Speech processing tools like Audacity, Speech Analyzer do exactly the same. Such a tool records the speech sample to plot the spectrogram. Thus, the raw speech data being converted to spectrogram would be the starting point for further processing. Hence, our project will require the use of a speech processing tool with the ability to perform spectral analysis on voice samples and can provide energy values in a grid of time versus frequency.

Fig-1 the Spectrogram generated in Audacity

Fig-1 is the Spectrogram (Time in Seconds Vs Frequency in Hertz graph) generated when user uttered the phrase ‘My Name Is’. Here, the three words ‘My’, ‘Name’ and ‘is’ can be easily identified with the dark regions. The darkness corresponds to the higher energy levels in the speech sample. Apart from plotting the spectrogram, the Speech processing tools have various added features which are useful for our voice based authentication systems. For instance, Audacity allows the spectrogram data to be exported to external file system.

2

Voiceprint Systems CS691 2012 Fall Batch

Utterance Segmentation:- Because the biometric system will operate on the initial "My Name Is" portion of the utterance, that portion must be separated (isolated) from the background noise at the beginning of the recording and the person's name at the end. The speech portion of the signal will first be extracted from the background noise by threshholding the signal's energy function. Then, the [z] sound in the word "is" will be located to mark the end of the common portion of the utterance, and the signal's energy function to detect the end of the person's name. Each utterance will therefore be segmented into the common portion, "My Name Is", and the person's name portion of the utterance.

Feature Extraction:- Features (data measurements) will be designed and code written to extract them from each sample utterance. The output of the feature extractor will be a fixed-length vector of measurements appropriate for input to the Pace University biometric authentication system. Several feature sets will be explored, and all features will be normalized over the varying lengths of the speech utterances. An understanding of phonetics is important to select information that is characteristic to each user as input to the biometric system.

Features will then be extracted from the numeric values of the spectral analysis. For example, one feature set could consist of the means and variances of each of the 13 frequency bands over the entire utterance, for a total of 26 features per utterance. Additional features will be extracted from each of the seven sound regions of the utterance. For example, the same 13 frequency bands could be divided into its 7 speech sounds each utterance, averaging the energy in the 13 frequency bands within each of the 7 sounds, for a total of 91 features. The first Cepstral component might be omitted because it represents the energy of the signal and is probably not speaker specific.

4. Tools/LibrariesThe following tools/libraries were used for implementing the project.AudacityAudacity is open source audio editing software which is used for the voice recording and processing in this project. Audacity is available in all major platforms, including Windows, Mac and Linux. Audacity is chosen for this project because it supports sound editing and audio spectrum analysis for sound processing.

Sound samples can be imported into Audacity using the menu File →Import →Audio or can be recorded using the provided audio control buttons. Audacity displays the sound in a two dimensional waveform.

Fig-2 Time WaveformTime Spectrograph view can be loaded from the context menu Audio Track →Spectrum.

Fig-3 Time Spectrogram

The sound samples are recorded in a single channel (mono) using Audacity. Using Audacity it is possible to convert a stereo sound to mono using the menu "Track Stereo Track to Mono". The sampling rate used in this project is 44100 samples per sec. It is also possible to convert sound recorded in other sampling rates to 44100 using the Audacity menu item “Track Resample”.

Spectral Analysis using JTransformsJTransforms, which is benchmarked as the fastest java fft library is identified as the speech processing library. JTransforms is the first, open source, multithreaded FFT library written in pure Java. It supports representation of a voice sample as arrays of frequency and Amplitude pairs for every time frame, which is achieved through discrete Fourier transformations.

Fig-4 Spectrogram

Fig-4 represents the spectrogram plotted for the same voice sample in Fig-2. The spectrogram is constructed

3

Voiceprint Systems CS691 2012 Fall Batch

using the energy values obtained for the 13 Mel frequency bands through discrete Fourier Transformation on the voice sample using the Jtransforms library.

5. ImplementationThe implementation can be logically grouped into four modules, as follows.

Fig-5 Pictorial representation of the different modules in the implementation

Module 1 – Sound Processing and Spectrogram generation: - This procedure involves the steps shown diagrammatically in Figure-7. The steps numbered from 1-6 are explained below. Step-1 shows a voice sample in wav format (Mono, 44100 samples per second) read as input for processing. A waveform visually represents an audio signal and is made of samples, which is a discrete value at a point representing the audio at that point. Digital audio is sampled at discrete points. The samples can be seen by zooming in the wave representation in Audacity. Sample rate is measured in Hz and represents the number of digital samples captured per second in order to represent the wave form.

Fig-6 Samples from a digital audio waveform

Step-2 represents the input stream collected into frames containing 1024 samples. The next sample read slides the stream by 512 bytes, thereby creating an overlap between two adjacent frames as shown in the diagram below.

Fig -7: Sound frames created with an overlap of 512 bytes

4

Voiceprint Systems CS691 2012 Fall Batch

Fig-8 Pictorial representation of Spectrogram generation

5

Voiceprint Systems CS691 2012 Fall Batch

Step-3 shows the samples that are part of a frame. Since we account for 1024 entries per frame and 44100 Samples per second, it gives frame size of 44100/1024 = 43.06 Frames / Sec. So, one Frame size is approximately 23 milliseconds (1000 milliseconds/43.06)

Step-4 shows the component frequencies of a frame after applying Fourier transformation to a frame of samples. After the samples are converted into Fourier series, it can be viewed in the frequency domain as opposed to the time domain. This view of the wave in terms of fundamental frequencies that are present is called the signal's spectrum. The spectrum allows sound modification by amplification or attenuation of specific frequency bands. A spectrum can be plotted by mapping spectral amplitudes to grey levels (or colorizing). Higher amplitudes will show darker spectral regions.

Step-5 represents the complete spectral data available for processing

6. The spectrum constructed from the values.

Module2 – Building Mel-Frequency Filter bank

This module builds Mel-Frequency Filter Banks using the frequency components produced in Module 1. Mel-Frequency Cepstral coefficients capture robust characteristics of the spectral envelope of audio signals.

For each of the frames captured from the sound clip,

1. Calculate the fixed number of Mel Frequency Filter Banks according to the Mel-scale.2. Filter the input spectrum through a bank of number of Mel-filters.3. Calculate the sum of contents of each band.4. Take logarithm on each sum.5. Compute the DCT of the logarithm values.

Mel Frequency Filter Bank Calculation: - Each Mel band will be created with the shape of a triangle. The triangles overlap to cover the whole range specified in the input spectrum. The center of a triangle will be the last edge of the preceding triangle and the first edge of the succeeding triangle. This is represented in the picture shown below.

Fig-9 Mel-frequency triangles

The distance at the base from center to left edge is different from the distance at the base from center to the right edge. The center frequencies are based on Mel-frequency scale, which is a non-linear scale that models the non-linear human hearing behavior. Mel filtering has more filters available for the lower frequencies and emphasizes the lower frequencies as shown in the fig-8.

Mel frequencies is calculated from linear frequency as below

melFreq(freq) = 2595.0 * log10(1.0 + freq / 700.0)

Linear frequency is calculated from Mel frequency as below

linFreq(freq) = 700.0 * pow(10.0, (freq / 2595.0)) – 1.0)

Mel Band Calculation: -

1. Calculate the Mel frequency of the lowest frequency.2. Calculate the Mel frequency of the maximum frequency.3. Calculate the width of the band = (max – min)/number of bands4. for the required number of bands

i. Calculate the center of each triangle = index * width

ii. Convert the center of each triangle to linear frequency

iii. Assign center of the current triangle as the last edge of the previous triangle

iv. Assign center of the current triangle as the first edge of the succeeding triangle

6

Voiceprint Systems CS691 2012 Fall Batch

Mel Frequency Calculation: -

1. For each of the Mel frequency bandsi. Calculate the height of the center assuming

the area of the triangle to beii. Calculate the slope of the left edge and

right edge of the triangleiii. Calculate weights for the selected

frequencies based on the identified heightiv. Collect the sum of the weighted frequencies

values2. Return the collection

With the above calculations, for every frame, the energy values for each of the 13 Mel frequency bands can be identified, which is pictorially represented in the below.

Fig-10 Disected view of a Fourier Transformed frame

Fig-10 shows multiple waves generated after applying Discrete Fourier Transformation on the input wave. The axes represent time, amplitude and frequency. Each frame of input wave is input to the Mel Filter. A frame of data will contain samples collected from each wave at the similar indices and represents energy values collected across multiple frequencies. The red lines indicate the energy samples mapped on the Mel Frequency Filters calculated across the frequency range. Mel Frequency Filters generate 13 frequency samples per frame thus generating a two dimensional array of frequency values – which we refer as the two dimensional feature vector array, in the rest of this document. The two dimensional array will be in the following format.Timeframe Mel1 Mel2 - - Mel13

t1 6.70E-04 9.58E-04 9.84E-04t2 6.75E-04 9.64E-04 9.90E-04t3 6.79E-04 9.70E-04 9.96E-04- - -- - -tn 6.83E-04 9.75E-04 0.001000954

Cepstral Mean Normalization: - Since some users speak louder than others, we have to normalize the volume (loudness) of the utterances. The mean energy in a band is normalized by the total energy of the utterance. The energy values across thirteen Mel-frequency bands for all the frames obtained as above is normalized as follows

N is the total number frames in an utterance and x[n] is the Energy value of a particular Mel-frequency Band ‘n’.

Module 3 - Utterance Segmentation: -In this module the common phrase - “My Name Is” from the voice sample is segmented out for further processing, i.e., the frame number or index of the start and end frames of "My Name Is" is computed. This is accomplished through a two-step process.

Step-1 - The start of the “My Name Is“ is calculated by thresholding energy values using the algorithm provided by L.R. Rabiner and M.R. Sambur (REF[17]). Energy can be defined as Sum of magnitudes of 10 ms of sound centered on interval which can be expressed by the formula:-

– E(n) =∑i=-50 to 50 |s(n + i)|

1) Module 2 generates the energy data for each of the 13 Mel frequency bands, for all the frames in the voice sample. This energy data is normalized using the Cepstral Normalization and converted into the feature vector array discussed towards the end of Module2

2) Calculate the Energy thresholds (Energy_Upper and Energy_Lower) using the below formulae

Let EMAX represent Max energy and EMIN represent SilenceCalc1 = 0.03 * (EMAX - EMIN) +

EMINCalc2 = 4 * EMINEngergy_Lower = min(Calc1, Calc2)Engergy_Upper = 7.25 *

Engergy_Lower

3) Incrementally search Sample for energy greater than EMIN and save its time-stamp, say “s”.

4) Incrementally resume the search from “s” and if found energy greater than EMAX

7

Voiceprint Systems CS691 2012 Fall Batch

Assign the timestamp of the first occurence to “start”

elseif found energy falling below EMIN

break;

Step-2 - The end of the utterance "My Name Is" is calculated by locating the [z] sound at the end of "is". Locating the [z] sound is done by thresholding the ratio of the energy in the high frequency bands (sum the energy in the bands over about 5 kHz) to that in the low frequency bands (sum those under 5 kHz).

1) Create a single dimensional array consisting of the above ratio calculated for every frame in the voice sample, say “ampRatio” array. 2) Incrementally search ampRatio from "start" (identified in step1) for energy greater than Energy_Upper (calculated in step1) and Save its time-stamp , as "end".

This algorithm addresses the following potential problems in segmenting the utterances.

Noisy computer room with background noise

o Weak fricatives: /f, th, h/o Weak plosive bursts: /p, t, k/o Final nasalso Voiced fricatives becoming devoicedo Trailing off of sounds (ex: binary, three)o Simple, efficient processingo Avoid hardware costs

Testing of the utterance values is done using two methods. In method1, an excel utility was developed to plot

the amplitude Vs time values of the whole voice sample in a graph, which was then used to mark the start and end portions returned by the utterance segmentation module and verify if it was proper. Fig-12 shows one such example.

Method2 was to play the segmented portion of the voice sample y and verify if the segmentation is done properly.

Fig-11 Amplitude Vs Time plottd in the excel utility

Module 4 - Dynamic Time Warping Algorithm (DTW):

(DTW) is an algorithm for measuring similarity between two sequences which may vary in time or speed. The sequences are "warped" non-linearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension. Thus, it aligns the corresponding energy values of two voice samples with respect to time, based on the similarity in these values across the corresponding frequency bands in both the samples.

Let us consider an input voice sample (utterance of a common phrase by a person) and the comparison of this input voice with the stored voice sample of the same person for the same phrase. In the stored sample this person had uttered the phrase ‘My Name Is’ as follows (below line shows an indication of the utterance of the speech sample with respect to time frame)

[m][m] [m] [I] [I][I] [n] [n] [eI] [eI] [eI] [eI] [eI] [eI] [m] [i] [i] [i] [z]t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18

But, in the input voice, the same person utters the same phrase in a slightly different way as follows (below line shows an indication of the utterance of the input voice with respect to time frame)

[m][m][m] [m] [I] [I] [n][n] [eI][eI] [eI][eI] [eI][eI] [eI] [m] [i][i][i] [z]t0 t1t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19

If we perform the comparison of the stored sample and the input voice sample based on energy values matching over time, the comparison would fail (even though both

8

Voiceprint Systems CS691 2012 Fall Batch

are uttered by the same person), because the energy values will be different for the corresponding time frames across different voice samples. The DTW algorithm is used in comparing the two voice samples.

The DTW Algorithm:

Assume “Sample Template Array” contains the energy data for the stored voice sample and “Input Value Array” contains the energy data for the input voice sample.

CALL DynTimeWrapMatrix (Sample Template Array (1 to n)Input Value Array (1 to m))

Array<int> DTWDistance (char s[1..n], char t [1..m]) { Declare int DTW [0...n, 0...m] Declare int i, j, cost, difference Declare Array<String> path

For i: = 1 to m DTW [0, i]:= infinity For i: = 1 to n DTW [i, 0]:= infinity DTW [0, 0]:= 0

For i: = 1 to n For j: = 1 to m Cost: = d(s[i], t[j]) DTW [i, j]:= cost + minimum (DTW [i-1, j], // insertion DTW [i, j-1], // deletion DTW [i-1, j-1]) // match

Difference = DTW [n, m]

I: = m-1J: = n-1While (i>1) {Path. Append ("\""+i+" "+j+"\",")

if(DTW[i-1][j]<DTW[i ][ j-1]&&DTW[i-1][j]<DTW[i-1][ j-1]) { tmp[0] = i-1; tmp[1] = j;

}else if(DTW[i ][ j-1]<DTW[i-1][ j-1]&&DTW[i ][ j-1]<DTW[i-1][j]){ tmp[0] = i; tmp[1] = j-1; }else{ tmp[0] = i-1; tmp [1] = j-1; }

i=tmp[0];j=tmp[1];

}return path}

Implementation details of the algorithm:

The Stored Voice Sample and the input voice sample are converted into feature vectors of two dimensional arrayss, which consists of array of energy values per time frame. This is achieved by executing Module 1 and Module 2 on these voice samples.

For the stored voice sample, the frame numbers (indices of the feature vector array) of the seven phonemes are marked manually. Module 3 is executed on the input voice sample to get the frame numbers (indices of the feature vector array) of the utterance “My Name Is”. When the input voice sample runs through Module 3, the beginning of the phrase, that is the location of [m] as well as the end of the phrase, which is location of [z] are identified.

Comparing the stored voice sample with the input voice sample is achieved via the pair-wise comparison of the two dimensional feature vectors belonging to each of the voice samples. The algorithm is divided into two parts - DTW Matrix generation and DTW Warp path generation.

DTW Matrix generation: In this part, the DTW matrix is generated with the x-axis representing the frames of the input voice sample and the y-axis representing the frames of the stored voice sample. Each feature vector (every column of the 2-D array mentioned above) represents a frame in the wav file which contains thirteen energy values across the corresponding Mel-frequency bands. The difference between these two single dimensional arrays, one belonging to a frame of the input sample and other is that of the stored voice sample, is calculated by taking the Euclidian distance between two arrays of same length (the length is the number of Mel-frequency bands)

Code snippet:-

9

Voiceprint Systems CS691 2012 Fall Batch

double getDiff(double[] vector1, double[] vector2){ double sqSum = 0.0; for (int x=0; x<vector1.length; x++) sqSum += Math.pow(vector1[x]-vector2[x], 2.0); return Math.sqrt(sqSum);}

In the above snippet, the square of the difference between the two energy values for every frame (one belonging to the input file and the other belong to the stored voice sample) is computed. Over the entire frequency bands (thirteen frequency bands) these values are summed up and the square root of which is considered as a difference between the corresponding frames of two wav files. Thus the difference between a frame of the input file and a frame of the stored voice sample are stored in the DTW matrix.

DTW Warp path generation:In this part of the algorithm, we compute the warp path from the above mentioned DTW matrix. The warp path will start from (0, 0). In the pair mentioned (0,0), first zero represent the zeroth frame of the stored voice sample and second zero represent the zeroth frame of the input voice. Then, the path moves a column and row up by one and finds the minimum of the adjacent cells and assumes the cell with the minimum value (of difference) in this matrix as its next element of the warp path. In the context of DTW algorithm, by definition, the adjacent cells are as follows, For cell (i,j), cells ( i-1,j-1), (i-1,j) and (i,j-1) are considered to be adjacent cells. Thus in each step of the Warp path generation algorithm, the eventual path gets populated by visiting the least values among the adjacent cells of the current cell in the matrix.Thus in the DTW module, with the manually marked 7 values of the stored voice sample, we will be able to identify its corresponding entries (corresponding frame numbers) in the input voice file using the Warped path output.

6. Feature Vectors

The below are the feature vectors extracted from a voice sample. Mel-Frequency bands: -The following are the Mel-frequency bands identified using the algorithm explained in Module 2.

Mel1 43.06641 258.3984 516.7969Mel2 258.3984 516.7969 861.3281Mel3 516.7969 861.3281 1291.992Mel4 861.3281 1291.992 1808.789Mel5 1291.992 1808.789 2497.852Mel6 1808.789 2497.852 3402.246Mel7 2497.852 3402.246 4565.039Mel8 3402.246 4565.039 5986.23Mel9 4565.039 5986.23 7838.086Mel10 5986.23 7838.086 10206.74Mel11 7838.086 10206.74 13221.39Mel12 10206.74 13221.39 17097.36Mel13 13221.39 17097.36 22006.93

Frequency Values

Means of the energy in each of the 13 frequency bands over the entire utterance:- The means (averages) of the energy in each of the 13 frequency bands over the entire utterance are the average of the energies in the 1st band over all the time frames in the utterance, the average of the energies in the 2nd band over all the time frames in the utterance, up to the average of the energies in the 13st band over all the time frames in the utterance. The following represents the same for one of the voice samples.

0.001311530983234178,0.001873642881061818,0.0019241499370171089,0.0019855898578579434,0.0020285235923588386,0.0020259633982432487,0.0019617447242892927,0.0018566023310857796,0.001741814040890228,0.0015585439712053949,0.0012942419943928502,9.38535824396694E-4,4.970333828792828E-4

Variance of the energy in each of the 13 frequency bands over the entire utterance:- The variance of the energy in each of the 13 frequency bands over the entire utterance are the variance of the energies in the 1st band over all the time frames in the utterance, the variance of the energies in the 2nd band over all the time frames in the utterance, up to the variance of the energies in the 13 nth

band over all the time frames in the utterance. The following represents the same for one of the voice samples.

1.1366486634175697E-6,2.286794989336336E-6,2.3047033489611723E-6,2.2974703169422097E-6,2.248687780596088E-6,2.150692492465514E-6,2.002962553425179E-6,1.7151889383725914E-6,1.2385835444677442E-6,8.34532286199922E-

10

Voiceprint Systems CS691 2012 Fall Batch

7,5.278041392084494E-7,2.896128871009206E-7,9.039157265220525E-8

Means of Energy values for each of the phonemes across the 13 frequency Bands:- This represents the average of energy values for the frames representing a particular phoneme, across the 13 frequency bands. The following represents the same for one of the voice samples.

[m]0.0013972420279314533,0.0019839626541069424,0.0019917590618294886,0.0019652068201489467,0.0018763200207494596,0.001712654322650207,0.0014739191769165653,0.001287032667793283,0.0012921734778294312,0.0012476567319822146,0.001115619702150432,8.597318161915143E-4,4.731870524317264E-4

[I]5.781191318994022E-4,8.292674717395435E-4,8.62553133222416E-4,9.131546016873072E-4,9.698349442348719E-4,0.0010120143935706902,0.001017800648944875,9.715152999262889E-4,8.764347506869316E-4,7.406915128134425E-4,5.836284190783692E-4,4.1300964976330825E-4,2.2004241741992116E-4

[n]0.001121303878344767,0.0015905091897366825,0.001595812989688217,0.0015856473963210631,0.0015458332574145325,0.0014743386922617418,0.0013828248726850778,0.0012601990834280687,0.0010679417311732769,7.763529523037199E-4,4.3478918073690926E-4,1.531282553329901E-4,1.726926861328644E-5

[a]0.001239777436049448,0.001774818534505508,0.001834190418327238,0.0019162579785578081,0.0019904481511436776,0.0020127944671453476,0.0019504118224886173,0.0018418864670861853,0.0017458112977919399,0.0015694378286064416,0.0013003372256974327,9.344269196361214E-4,4.89230166263696E-4

[m]1.7128214037841856E-5,1.1457207781984634E-5,2.278976222100837E-5,7.348890251033494E-5,1.3830813772290466E-4,2.1192448193083264E-4,2.8197367149024853E-4,3.393211490497537E-4,4.054302527442985E-4,4.937672857099137E-4,5.698699335443966E-4,5.444622426356042E-4,3.444606790879088E-4

[I]0.0018664589756396425,0.0026732142001958032,0.0027574370751781555,0.0028532023613564625,0.0029200318197828766,0.0029289220707573887,0.002858263521490533,0.0026862090728245074,0.0023948871582478485,0.0019840672624308604,0.0014951751438568716,9.836473153712183E-4,4.840702714917795E-4

[z]5.789435514080292E-4,8.074036581859748E-4,7.72092134211925E-4,7.026136154060468E-4,5.760944347754057E-4,3.589499336297782E-4,2.620999563269816E-5,3.9226967242426004E-4,7.965042720202917E-4,0.0010475939418523488,0.0010685104697961663,8.625149724393531E-4,4.8007674640999725E-4

Manual marking of phonemes:- This represents the manual marking of frame numbers (indices of the two dimensional array) which represent each of the phonemes. The following represents the manual marking of phonemes in Voice Sample1 (say, the stored voice sample).

[m]- 180 - 182[I] – 183 - 185[n]- 186 - 187[a]- 188 - 193[m]- 194[i]- 195 - 197[z]- 198

Utterance segmentation: - This represents the segmentation of the utterance portion “My Name Is” from the voice sample, i.e., calculating the frame numbers (indices of the two dimensional array) for the beginning and end of the phrase. The following represents the frame number’s for the segmentation in voice sample 2 (say, the input voice sample).

Start = 386End = 405

11

Voiceprint Systems CS691 2012 Fall Batch

Fig-12 Max of Normalized Cepstral values plotted against time, in which the “start” is identified using the Energy_Upper threshold

Fig-13 Ratio of sum of energy in higher bands Vs sum of energy in lower bands plotted against time.

Warp path calculation:- This represents the warp path calculated between the input voice sample and the stored voice sample. The following represent the warped path for the comparison between Voice Sample2 (input voice sample) and Voice Sample1 (stored voice sample) mentioned in the above two feature vectors, respectively.

[(0,0),(0,1),(1,2),(2,3),(3,4),(4,5),(5,5),(6,5),(7,6),(7,7),(8,8),(9,9),(10,10),(11,11),(12,12),(13,13),(14,14),(15,15),(16,16),(16,17),(17,18)]

Fig-14 is a pictorial representation of the above warp path.

Fig-14 Warped Path

Automatic marking of input sample:- The following is the Markings of all the frames of “My Name Is” in the input voice sample against the stored voice sample: -

[(180,386),(180,387),(181,388),(182,389),(183,390),(184,391),(185,391),(186,391),(187,392),(187,393),(188,394),(189,395),(190,396),(191,397),(192,398),(193,399),(194,400),(195,401),(196,402),(196,403),(197,404)]

7. References

[1] Maxion, Roy A. (2011). Making Experiments Dependable. The Next Wave / NSA Magazine, Vol. 19, No. 1, pp. 13-22, March 2012, National Security Agency, Ft. Meade, Maryland. Reprinted from Dependable and Historic Computing, LNCS 6875, pp. 344-357, Springer, Berlin, 2011.[2] Maxion, Roy A. and Killourhy, Kevin S. (2010). Keystroke Biometrics with Number-Pad Input. In IEEE/IFIP International Conference on Dependable Systems & Networks (DSN-10), pp. 201-210, Chicago, Illinois, 28 June to 01 July 2010. IEEE Computer Society Press, Los Alamitos, California, 2010.[3] Killourhy, Kevin S. and Maxion, Roy A. (2009) Comparing Anomaly-Detection Algorithms for Keystroke Dynamics. In International Conference on Dependable Systems & Networks (DSN-09), pp. 125-134, Estoril, Lisbon, Portugal, 29 June to 02 July 2009. IEEE Computer Society Press, Los Alamitos, California, 2009.[4] Cha, S. & Srihari, S.N. (2000). Writer Identification: Statistical Analysis and Dichotomizer. Proc. SPR and SSPR 2000, LNCS - Advances in Pattern Recognition, v. 1876, 123-132.[5] Yoon, S., Choi, S-S., Cha, S-H., Lee, Y., & Tappert, C.C. (2005). On the individuality of the iris biometric.

12

Voiceprint Systems CS691 2012 Fall Batch

Proc. Int. J. Graphics, Vision & Image Processing, 5(5), 63-70.[6] R.S. Zack, C.C. Tappert & S.-H. Cha (2010). Performance of a Long-Text-Input Keystroke Biometric Authentication System Using an Improved k-Nearest-Neighbor Classification Method. Proc. IEEE 4th Int Conf Biometrics: Theory, Apps, and Systems (BTAS 2010), Wash. D.C.[7] Tappert, C.C, Cha, S.-H., Villani, M., & Zack, R.S. (2010). A Keystroke Biometric System for Long-Text Input, invited paper, Int. J. Info. Security and Privacy (IJISP), Vol 4, No 1, 2010, pp 32-60.[8] J.C. Stewart, J.V. Monaco, S. Cha, and C.C. Tappert (2011). An Investigation of Keystroke and Stylometry Traits. Proc. Int. Joint Conf. Biometrics (IJCB 2011), Wash. D.C., October 2011.[9] Roy Morris, Summer Project, 2012.[10] Naresh Trilok's M.S. Thesis, 2004.[11] NSF Grant Proposal, 2012.[12] Lisa J. Stifelman, MIT Media Laboratory. A Discourse Analysis Approach to Structured Speech.[13] Kevin Gold ([email protected]) and Brian Scassellati ([email protected]) - Audio Speech Segmentation Without Language-Specific Knowledge [14] Meng Wang, Donald Bitzer, David McAllister, Robert Rodman, Jamie Taylor - An Algorithm for V/UV/S Segmentation of Speech.[15] Khurram Waheed, Kim Weaver and Fathi M. Salam A robust algorithm for detecting speech segments using an entropic contrast. [16] Mohammad T. Shami and Mohamed S. Kamel - Segment-based approach to the recognition of emotions in speech[17] L.R. Rabiner and M.R. Sambur An Algorithm for Determining the Endpoints for Isolated Utterances [18] Nitin N Lokhande, Dr.Navnath S Nehe and Pratap S Vikhe - Voice activity detection Algorithm for Speech Recognition Applications [19] Lena Maier-Hein Speech Recognition Using Surface Electromyography [20] https://sites.google.com/site/piotrwendykier/software/jtransforms[21] http://www.computerworld.com[22] http://www.web2.utc.edu[23] http://www.wikipedia.org/[24] http://audacity.sourceforge.net/[25] http://csis.pace.edu/~ctappert/it691-12fall/ default.htm[26]http://research.microsoft.com/pubs/78423/darpa93.pdf  

13