nlp (fall 2013): audio processing, zero crossing rate, dynamic time warping, spoken word recognition
TRANSCRIPT
Natural Language Processing
Audio Processing, Zero Crossing Rate,
Dynamic Time Warping, Spoken Word
Recognition
Vladimir Kulyukin
www.vkedco.blogspot.com
Outline
Audio Processing
Zero Crossing Rate
Dynamic Time Warping
Spoken Word Recognition
Audio Processing
Samples
Samples are successive snapshots of a
specific signal
Audio files are samples of sound waves
Microphones convert acoustic signals into
analog electrical signals and then analog-to-
digital converter transform analog signals
into digital samples
Digital Audio Signal
time
Sound
pressure
Amplitude
Amplitude (in audio processing) is a
measure of sound pressure
Amplitude is measured at a specific rate
Amplitude measures result in digital
samples
Some samples have positive values
Some samples have negative values
Digital Approximation Accuracy
Any digitization of analog signals carries some
inaccuracy
Approximation accuracy depends on two
factors: 1) sampling rate and 2) resolution
In audio processing, sampling is reduction of
continuous signal to discrete signal
Sampling rate is the number of samples per unit
of time
Resolution is the size of a sample (e.g., the
number of bits)
Sampling Rate & Resolution
Sampling rate is measured in Hertz
Hertz or Hz are measured in samples per
second
For example, if the audio is sampled at a
rate of 44100 per second, then its sampling
rate is 44100Hz
Some typical resolutions are 8 bits, 16 bits,
and 32 bits
Nyquist-Shannon Sampling Theorem
This theorem states that perfect reconstruction of
a signal is possible if the sampling frequency is
greater than two times the maximum frequency of
the signal being sampled
For example, if a signal has a maximum frequency
of 50Hz, then it can, theoretically, be
reconstructed if sampled at a rate of 100Hz and
avoid aliasing (the effect of indistinguishable
sounds)
Audio File Formats
WAVE (WAV) is often associated with Windows but
are now implemented on other platforms
AIFF is common on Mac OS
AU is common on Unix/Linux
These are similar formats that vary in how they
represent data, pack samples (e.g., little-endian vs.
big-endian), etc.
Java example of how to manipulate Wav files can
be downloaded from WavFileManip.java
Zero Crossing Rate
What is Zero Crossing Rate (ZCR)?
Zero Crossing Rate (ZCR) is a measure of the
number of times, in a given sample, when
amplitude crosses the horizontal line at 0
ZCR can be used to detect silence vs. non-
silence, voice vs. unvoiced, speaker’s identity,
etc.
ZCR is essentially the count of successive
samples changing algebraic signs
ZCR Source
public class ZeroCrossingRate {
public static double computeZCR01(double[] signals, double normalizer)
{
long numZC = 0;
for(int i = 1; i < signals.length; i++) {
if ( (signals[i] >= 0 && signals[i-1] < 0) ||
(signals[i] < 0 && signals[i-1] >= 0) ) {
numZC++;
}
}
return numZC/normalizer;
}
}
source code is in ZeroCrossingRate.java
ZCR in Voiced vs. Unvoiced Speech
Voiced speech is produced when vowels are spoken
Voiced speech is characterized of constant
frequency tones of some duration
Unvoiced speech is produced when consonants are
spoken
Unvoiced speech is non-periodic, random-like
because air passes through a narrow constriction of
the vocal tract
ZCR in Voiced vs. Unvoiced Speech
Phonetic theory states that voiced speech
has a smooth air flow through the vocal tract
whereas unvoiced speech has a turbulent air
flow that produces noise
Thus, voiced speech should have a low ZCR
whereas unvoiced speech should have a high
ZCR
Amplitude of Voiced vs. Unvoiced Speech
Amplitude of unvoiced speech is low
Amplitude of voiced speech is high
Given a digital sample, we can use average
amplitude as a measure of the sample’s
energy
This can be used to classify samples as
vowels and consonants
ZCR & Amplitude of Voiced & Unvoiced Speech
ZCR Amplitude
Voiced LOW HIGH
Unvoiced HIGH LOW
Detection of Silence & Non-Silence
silence_buffer = [];
non_silence_buffer = [];
buffer = [];
while ( there are still frames left ) {
Read a specific number of frames into buffer;
Compute ZCR and average amplitude of buffer;
if ( ZCR and average amplitude are below specific thresholds ) {
add the buffer to silence_buffer;
}
else {
add the buffer to non_silence_buffer;
}
}
source code is in WavFileManip.detectSilence()
Dynamic Time Warping
source code is in DTW.java
Introduction
Dynamic Time Warping (DTW) is a method to
find an optimal alignment between two time-
dependent sequences
DTW aligns (“warps”) two sequences in a non-
linear way to match each other
DTW has been successfully used in automatic
speech recognition (ASR), bioinformatics
(genetic sequence matching), and video
analysis
Basic Definitions
There are two sequences:
𝑋 = 𝑥1, … , 𝑥𝑁 and 𝑌 = 𝑦1, … , 𝑦𝑀
There is a feature space F such that:
𝑥𝑖 ∈ 𝐹 & 𝑦𝑗 ∈ 𝐹 where 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑀
There is a local cost measure mapping 2-
tuples of features to non-negative reals:
𝑐: 𝐹 x 𝐹 → 𝑅 ≥ 0
Sample Sequences
Sample Alignment
X
Cost Matrix DTW(N, M)
Y
1 2 …. i … N
M
1
2
…
𝑑𝑡𝑤 𝑖, 𝑗 is the cost of warping X[1:i] with Y[1:j]
j
…
X and Y are sequences X[1:N] and Y[1:M]
Warping Path
𝑃 = 𝑝1, … , 𝑝𝐿 , where 𝑝 = 𝑛𝑗 , 𝑚𝑗 ∈ 1, 𝑁 × [1,𝑀] and
𝑗 ∈ 1, 𝐿 is a warping path if
1) 𝑝1 = 1,1 and 𝑝𝐿 = 𝑁,𝑀 2) 𝑛1 ≤ 𝑛2 ≤ … ≤ 𝑛𝑁 and 𝑚1 ≤ 𝑚2 ≤ … ≤ 𝑚𝑀
3) 𝑝𝑙+1 − 𝑝𝑙 ∈ 1, 0 , 0, 1 , 1, 1 , 1 ≤ 𝑙 ≤ 𝐿 − 1
Valid Warping Path
1 2 3 4
1
2
3
4
5
𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where
𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 = 2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = (4, 5)
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5 𝑝6
Invalid Warping Path
1 2 3 4
1
2
3
4
5
𝑝1 ≠ 1, 1 so constraint 1 is not satisfied
𝑝1 𝑝2
𝑝3
𝑝4
𝑝5 𝑝6
Invalid Warping Path
1 2 3 4
1
2
3
4
5
𝑝3 = 3, 3 , 𝑝4 = 2, 4 , 3 > 2 so 2nd constraint is not satisfied
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5 𝑝6
Invalid Warping Path
1 2 3 4
1
2
3
4
5
𝑝2 = 2, 2 , 𝑝3 = 3, 4 , 𝑝3 − 𝑝2 = 3,4 − 2,2 = 1, 2 ∉1, 0 , 0, 1 , 1, 1 so 3rd condition is not satisfied
𝑝1
𝑝2
𝑝3
𝑝4 𝑝5
Total Cost of a Warping Path
𝑃 = 𝑝1, … , 𝑝𝐿 , is a warping path between sequences X
and Y, then its total cost is
𝑐𝑝 𝑋, 𝑌 = 𝑐(𝑥𝑛𝑗 , 𝑦𝑚𝑗)
𝐿
𝑗=1
Example
1 2 3 4
1
2
3
4
5 Assume that 𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where 𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 =2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = 4, 5 ,
is a warping path b/w X[1:4] and Y[1:5].
Then the total cost of P is
𝑐 𝑥1, 𝑦1 + 𝑐 𝑥1, 𝑦2 + 𝑐 𝑥2, 𝑦3 +𝑐 𝑥2, 𝑦4 + 𝑐 𝑥3, 𝑦5 + 𝑐 𝑥4, 𝑦5 .
This notation 𝑐 𝑥𝑖 , 𝑦𝑗 can be simplified
to read 𝑐(𝑖, 𝑗) or 𝑐 𝑋 𝑖 , 𝑌 𝑗 .
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5 𝑝6
X
Y
DTW(X, Y) – Cost of an Optimal Warping Path
𝐷𝑇𝑊 𝑋, 𝑌 = min 𝑐𝑝 𝑋, 𝑌 𝑝 is a warping path}
Remarks on DTW(X, Y)
There may be several warping paths of the
same DTW(X, Y)
DTW(X, Y) is symmetric whenever the local
cost measure is symmetric
DTW(X, Y) does not necessarily satisfy the
triangle inequality (the sum of the lengths of
two sides is greater than the length of the
remaining side)
X
DTW Equations: Base Cases
Y
1 2 …. i … N
M
1
2
…
Initial condition: 𝑑𝑡𝑤 1,1 = 𝑐(1,1)
j
…
1st Row: 𝑑𝑡𝑤 𝑖, 1 = 𝑑𝑡𝑤 𝑖 − 1,1 + 𝑐(𝑖, 1)
1st Column:
𝑑𝑡𝑤 1, 𝑗 = 𝑑𝑡𝑤 1, 𝑗 − 1 +𝑐(1, 𝑗)
X
DTW Equations: Recursion
Y
1 2 … i … N
M
1
2
…
j
…
Inner Cell: 𝑑𝑡𝑤 𝑖, 𝑗 = min 𝑑𝑡𝑤 𝑖 − 1, 𝑗 , 𝑑𝑡𝑤 𝑖 − 1, 𝑗 − 1 , 𝑑𝑡𝑤 𝑖, 𝑗 − 1 + 𝑐(𝑖, 𝑗)
Interpretation: Cost of
warping X[1:i] with Y[1:J] is
the cost of warping X[i] with
Y[j] plus the minimum of the
following three costs: 1) the
cost of warping X[1:i-1] with
Y[1:j]; 2) the cost of warping
X[1:i-1] with Y[1:j-1]; 3) the
cost of warping X[1:i] with
Y[1:j-1]
Example
Let the sequences be:
𝑋 = 𝑎, 𝑏, 𝑔 𝑌 = 𝑎, 𝑏, 𝑏, 𝑔 𝑍 = (𝑎, 𝑔, 𝑔)
Let the feature space 𝐹 = 𝑎, 𝑏, 𝑔 .
Let the local cost measure be
defined as follows:
𝑐 𝑥, 𝑦 = 0 𝑖𝑓 𝑥 = 𝑦1 𝑖𝑓 𝑥 ≠ 𝑦
Let us compute dtw(X,Y), dtw(Y,Z), and dtw(X, Z).
Work it out on paper.
DTW(X, Y)
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 2,1 = 𝑐 2,1 + 𝑑𝑡𝑤 1,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
1
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 3,1 = 𝑐 3,1 + 𝑑𝑡𝑤 2,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2
1 2
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 1,2 = 𝑐 1,2 + 𝑑𝑡𝑤 1,1= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
1 2
1
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 1,3 = 𝑐 1,3 + 𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2
1 2
1
2
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 1,4 = 𝑐 1,4 + 𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,3= 1 + 2 = 3
1 2
1
2
3
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 2,2= 𝑐 2,2
+min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1
= 𝑐 𝑏, 𝑏 + min 1,0,1 = 0 + 0= 0 1 2
1
2
3
0
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 3,2= 𝑐 3,2
+min𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 2,1 ,𝑑𝑡𝑤 3,1
= 𝑐 𝑔, 𝑏 + min 0,1,2 = 1 + 0= 1 1 2
1
2
3
0 1
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 2,3= 𝑐 2,2
+min𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 2,2
= 𝑐 𝑏, 𝑏 + min 2,1,0 = 0 + 0= 0 1 2
1
2
3
0 1
0
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 3,3= 𝑐 3,3
+min𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 3,1
= 𝑐 𝑔, 𝑏 + min 0,0,1 = 1 + 0= 1 1 2
1
2
3
0 1
0 1
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 2,4= 𝑐 2,4
+min𝑑𝑡𝑤 1,4 ,𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 2,3
= 𝑐 𝑏, 𝑔 + min 3,2,0 = 1 + 0= 1 1 2
1
2
3
0 1
0 1
1
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
𝑑𝑡𝑤 3,4= 𝑐 3,4
+min𝑑𝑡𝑤 2,4 ,𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 3,3
= 𝑐 𝑔, 𝑔 +min 1,0,1 = 0 + 0= 0
So DTW(X,Y) = 0
1 2
1
2
3
0 1
0 1
1 0
Example: DTW(X,Y)
a 𝑏 𝑔
𝑎
𝑏
𝑔
0
Y
X
𝑏
1 2 3
4
3
2
1
DTW(X, Y) = 0.
Optimal Warping Path
(OWP) P can be found by
chasing pointers (red
arrows): P = ((1,1), (2, 2),
(2, 3), (3, 4)). 1 2
1
2
3
0 1
0 1
1 0
DTW(Y, Z)
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0 Z
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 2,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
Z
1
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 3,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2
Z
1 2
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 4,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 3,1= 1 + 2 = 3
Z
1 2 3
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
Z
1 2 3
1
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2
Z
1 2 3
1
2
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,2 ,
𝑑𝑡𝑤 1,1 , 𝑑𝑡𝑤 2,1 }
= 1 + 0 = 1
Z
1 2 3
1
2
1
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 3,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,2 ,
𝑑𝑡𝑤 2,1 , 𝑑𝑡𝑤 3,1 }
= 1 +min 1,1,2 = 1 + 1 = 2
Z
1 2 3
1
2
1 2
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 4,2= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,2 ,
𝑑𝑡𝑤 3,1 , 𝑑𝑡𝑤 4,1 }
= 0 +min 2,2,3 = 0 + 2 = 2
Z
1 2 3
1
2
1 2 2
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,3 ,
𝑑𝑡𝑤 1,2 , 𝑑𝑡𝑤 2,2 }
= 1 +min 2,1,1 = 1 + 1 = 2
Z
1 2 3
1
2
1 2 2
2
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 3,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,3 ,
𝑑𝑡𝑤 2,2 , 𝑑𝑡𝑤 3,2 }
= 1 +min 2,1,2 = 1 + 1 = 2
Z
1 2 3
1
2
1 2 2
2 2
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
𝑑𝑡𝑤 4,3= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,4 ,
𝑑𝑡𝑤 3,2 , 𝑑𝑡𝑤 4,2 }
= 0 +min 2,2,2 = 0 + 2 = 2
Z
1 2 3
1
2
1 2 2
2 2 2
DTW(Y, Z)
a 𝑏 𝑏 𝑔
𝑎
𝑔
0
Y
𝑔
1 2 3 4
3
2
1
DTW(Y, Z) = 2.
Optimal Warping Path (OWP) P
can be found by chasing pointers
(red arrows): P = ((1,1), (2, 2), (3,
2), (4, 3)).
Z
1 2 3
1
2
1 2 2
2 2 2
DTW(X, Z)
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0
0
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 2,1 = 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
0 1
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 3,1 = 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2
0 1 2
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 1,2 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1
0 1 2
1
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 1,3 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2
0 1 2
1
2
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔
+ min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1
= 1 +min 1,0,1= 1 + 0 = 1
0 1 2
1
2
1
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 3,2= 𝑐 𝑔, 𝑔
+min𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 2,1 ,𝑑𝑡𝑤 3,1
= 0 +min 1,1,2= 0 + 1 = 1
0 1 2
1
2
1 1
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔
+ min𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 2,2
= 1 +min 2,1,1= 1 + 1 = 2
0 1 2
1
2
1 1
2
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1
𝑑𝑡𝑤 3,3= 𝑐 𝑔, 𝑔
+min𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 3,2
= 0 +min 2,1,2= 0 + 1 = 1
0 1 2
1
2
1 1
2 1
DTW(X, Z)
a 𝑏 𝑔
𝑎
𝑔 Z
X
𝑔
1 2 3
3
2
1 0 1 2
1
2
1 1
2 1 DTW(X, Z) = 1.
Optimal Warping Path (OWP)
P can be found by chasing
pointers (red arrows): P =
((1,1), (2, 2), (3, 3)).
Possible Optimizations of DTW
The computation of DTW can be optimized so that only the
cells within a specific window are considered
Possible Optimizations of DTW
You may have realized by now that if we care
only about the total cost of warping sequence X
with sequence Y, we do not need to compute
the entire N x M cost matrix – we need only two
columns
The storage savings are huge, but the running
time remains the same – O(N x M)
We can also normalize the DTW cost by N x M
to keep it low
Spoken Word Recognition
source code is in WavAudioDictionary.java
General Outline
Given a directory of audio files with spoken words, process
each file into a table that maps specific words (or phrases)
to digital signal vectors
These signal vectors can be pre-processed to eliminate
silences
An input audio file is taken and digitized into a digital
signal vector
The input vector is compared with DTW scores b/w the
input vector and the digital vectors in the table
Optimizations
If we use DTW to compute the similarity b/w the
digital audio input vector and the vectors in the table,
it is vital to keep the vectors as short as possible w/o
sacrificing precision
Possible suggestions: decreasing the sampling rate
and merging samples into super-features (e.g., Haar
coefficients)
Parallelizing similarity computations
References
M. Muller. Information Retrieval for Music and
Motion, Ch.04. Springer, ISBN 978-3-540-74047-6
Bachu, R. G., et al. “Separation of Voiced and
Unvoiced using Zero Crossing Rate and Energy of the
Speech Signal." American Society for Engineering
Education (ASEE) Zone Conference Proceedings. 2008.