cs 188: artificial intelligence spring 2009inst.eecs.berkeley.edu › ~cs188 › sp09 › slides ›...
TRANSCRIPT
1
CS 188: Artificial Intelligence
Spring 2009
Lecture 23: Speech Recognition
4/14/2009
John DeNero – UC Berkeley
Slides adapted from Dan Klein
Announcements
Written 3 due on Thursday in lecture
Please don‟t be late; no slip days allowed
Project 4 posted
Due next Wednesday 4/22 at 11:59pm
Use up to two slip days
Course contest update
You can qualify for the tournament starting tonight
First person to submit an agent will make me happy
2
2
Today
Particle filter recap
Speech recognition using HMMs
3
Particle Filtering Review
An approximate technique for filtering: P( Xt | e1 , … , et )
Idea: always keep N guesses (samples) of the value of Xt
Inititial samples, or particles, are drawn from the prior P(X1)
0.0 0.1
0.0 0.0
0.0
0.2
0.0 0.2 0.5
3
Particle Filtering Review
An approximate technique for filtering: P( Xt | e1 , … , et )
Idea: always keep N guesses (samples) of the value of Xt
Inititial samples, or particles, are drawn from the prior P(X1)
Three operations:1) Elapse time: draw a sample for Xt+1
from each particle using P(Xt+1 | Xt)
Particle Filtering Review
An approximate technique for filtering: P( Xt | e1 , … , et )
Idea: always keep N guesses (samples) of the value of Xt
Inititial samples, or particles, are drawn from the prior P(X1)
Three operations:1) Elapse time: draw a sample for Xt+1
from each particle using P(Xt+1 | Xt)
2) Observe: weight all particles by the likelihood of the evidence et
Inco
rpo
rate
ne
w e
vid
en
ce
4
Particle Filtering Review
An approximate technique for filtering: P( Xt | e1 , … , et )
Idea: always keep N guesses (samples) of the value of Xt
Inititial samples, or particles, are drawn from the prior P(X1)
Three operations:1) Elapse time: draw a sample for Xt+1
from each particle using P(Xt+1 | xt)
2) Observe: weight all particles by the likelihood of the evidence et
3) Resample: sample new particles in proportion to those weightsIn
co
rpo
rate
ne
w e
vid
en
ce
Resampling Step Details
Each particle is already weighted by
the evidence likelihood: P( et | xt )
We randomly choose particles in
proportion to those weights
Probability of choosing a particle
is proportional to its weight
Each new particle is chosen
independently, with replacement
The probability of selecting a set
of new particles is the product of
probabilities for each one
rain sun
0.7
0.7
0.3
0.3
X E P
rain umbrella 0.9
rain no „ella 0.1
sun umbrella 0.2
sun no „ella 0.8
5
SLAM
SLAM = Simultaneous Localization And Mapping
We do not know the map or our location
Our belief state is over maps and positions!
Main techniques: Kalman filtering (Gaussian HMMs) and particles
[DEMOS]
DP-SLAM, Ron Parr
Hidden Markov Models
An HMM is Initial distribution:
Transitions:
Emissions:
Query: most likely seq:
X5X2
E1
X1 X3 X4
E2 E3 E4 E5
10
6
State Path Trellis
State trellis: graph of states and transitions over time
Each arc represents some transition
Each arc has weight
Each path is a sequence of states
The product of weights on a path is the seq‟s probability
Can think of the Forward (and now Viterbi) algorithms as
computing sums of all paths (best paths) in this graph
sun
rain
sun
rain
sun
rain
sun
rain
11
Viterbi Algorithm
sun
rain
sun
rain
sun
rain
sun
rain
12
7
Example
13
Digitizing Speech
14
8
Speech in an Hour
Speech input is an acoustic wave form
s p ee ch l a b
Graphs from Simon Arnfield‟s web tutorial on speech, Sheffield:
http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/
“l” to “a”
transition:
15
Frequency gives pitch; amplitude gives volume
sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec)
Fourier transform of wave displayed as a spectrogram
darkness indicates energy at each frequency
s p ee ch l a b
Spectral Analysis
16
9
Adding 100 Hz + 1000 Hz Waves
Time (s)
0 0.05
–0.9654
0.99
0
17
Spectrum
100 1000Frequency in Hz
Am
pli
tud
e
Frequency components (100 and 1000 Hz) on x-axis
18
10
Part of [ae] from “lab”
Note complex wave repeating nine times in figure
Plus smaller waves which repeats 4 times for every large pattern
Large wave has frequency of 250 Hz (9 times in .036 seconds)
Small wave roughly 4 times this, or roughly 1000 Hz
Two little tiny waves on top of peak of 1000 Hz waves
19
Back to Spectra
Spectrum represents these freq components
Computed by Fourier transform, algorithm which
separates out each frequency component of wave.
x-axis shows frequency, y-axis shows magnitude (in
decibels, a log measure of amplitude)
Peaks at 930 Hz, 1860 Hz, and 3020 Hz.20
11
Resonances of the vocal tract
The human vocal tract as an open tube
Air in a tube of a given length will tend to vibrate at resonance frequency of tube.
Constraint: Pressure differential should be maximal at (closed) glottal end and minimal at (open) lip end.
Closed end Open end
Length 17.5 cm.
Figure from W. Barry Speech Science slides
21
From
Mark
Liberman’s
website22
12
Acoustic Feature Sequence
Time slices are translated into acoustic feature vectors (~39 real numbers per slice)
These are the observations, now we need the hidden states X
……………………………………………..e12e13e14e15e16………..
23
State Space
P(E|X) encodes which acoustic vectors are appropriate
for each phoneme (each kind of sound)
P(X|X‟) encodes how sounds can be strung together
We will have one state for each sound in each word
From some state x, can only:
Stay in the same state (e.g. speaking slowly)
Move to the next position in the word
At the end of the word, move to the start of the next word
We build a little state graph for each word and chain
them together to form our state space X
24
13
HMMs for Speech
25
Markov Process with Bigrams
Figure from Huang et al page 61826
14
Decoding
While there are some practical issues, finding the words
given the acoustics is an HMM inference problem
We want to know which state sequence x1:T is most likely
given the evidence e1:T:
From the sequence x, we can simply read off the words27
End of Part II!
Now we‟re done with our unit on
probabilistic reasoning
Last part of class: machine learning
28