speech in multimedia
DESCRIPTION
Speech in Multimedia. Hao Jiang Computer Science Department Boston College Oct. 9, 2007. Outline. Introduction Topics in speech processing Speech coding Speech recognition Speech synthesis Speaker verification/recognition Conclusion. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/1.jpg)
Speech in Multimedia
Hao Jiang
Computer Science Department
Boston College
Oct. 9, 2007
![Page 2: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/2.jpg)
Outline
Introduction
Topics in speech processing– Speech coding– Speech recognition– Speech synthesis– Speaker verification/recognition
Conclusion
![Page 3: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/3.jpg)
Introduction
Speech is our basic communication tool.
We have been hoping to be able to communicate with machines using speech.
C3PO and R2D2
![Page 4: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/4.jpg)
Speech Production Model
Anatomy Structure Mechanical Model
![Page 5: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/5.jpg)
Characteristics of Digital Speech
Waveform
Spectrogram
0 0.5 1 1.5 2
x 104
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Time
Fre
quen
cy
0 2000 4000 6000 8000 100000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Speech
![Page 6: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/6.jpg)
Voiced and Unvoiced Speech
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Silence unvoicedvoiced
![Page 7: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/7.jpg)
Short-time Parameters
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Short timepower
WaveformEnvelop
![Page 8: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/8.jpg)
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Zerocrossing rate
0 100 200 300 400 500 600 700-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Pitchperiod
![Page 9: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/9.jpg)
Speech Coding
Similar to images, we can also compress speech to make it smaller and easier to store and transmit.
General compression methods such as DPCM can also be used.
More compression can be achieved by taking advantage of the speech production model.
There are two classes of speech coders:– Waveform coder – Vocoder
![Page 10: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/10.jpg)
LPC Speech Coder
Speechbuffer
SpeechAnalysis
Pitch
Voiced/unvoiced
Vocal track Parameter
EnergyParameter
QuantizerCode
generation
speechCodestream
Frame n Frame n+1
![Page 11: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/11.jpg)
LPC and Vocal Track
x(n) = p=1k ap x(n-p) + e(n)
Mathematically, speech can be modeled as the following generation model:
{a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track.
e(n) is the excitation to generate the speech.
![Page 12: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/12.jpg)
Decoding and Speech Synthesis
ImpulseTrain
Generator
GlottalPulse
Generator
RandomNoise
Generator
VocalTrackModel
RadiationModel
Pitch Period
Gain
speech
U/V
![Page 13: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/13.jpg)
An Example for Synthesizing Speech
Blending region
Glottal Pulse
Go through vocal track filter with gain control
Go through radiation filter
![Page 14: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/14.jpg)
LPC10 (FS1015)
2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps.
LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients.
OriginalSpeech
LPC DecodedSpeech
![Page 15: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/15.jpg)
Mixed Excitation LP
For real speech, the excitation is usually not pure pulse or noise but a mixture.
The new 2.4kbps standard (MELP) addresses this problem.
Bandpassfilter
Bandpassfilter
+
w
1-w
pulses
noise
VocalTrackModel
RadiationModel
Gain
speech
OriginalSpeech
MELPDecodedSpeech
![Page 16: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/16.jpg)
Hybrid Speech Codecs For higher bit rate speech coders, hybrid speech codecs have
more advantage than vocoders.
FS1016: CELP (Code Excitation Linear Predictive) G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for
multimedia communication through Internet.
G.729: CELP based codec at 8kbps.
“perceptual”comparison
Model parametergeneration
Speechsynthesis
Analysis by Synthesis
speech code
Sound at 5.3kbps Sound at 6.3kbps
Sound at 8kbps
![Page 17: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/17.jpg)
Speech Recognition
Speech recognition is the foundation of human computer interaction using speech.
Speech recognition in different contexts– Dependent or independent on the speaker.– Discrete words or continuous speech.– Small vocabulary or large vocabulary.– In quiet environment or noisy environment.
Parameteranalyzer
Comparisonand decisionalgorithm
Language model
Reference patterns
speechWords
![Page 18: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/18.jpg)
How does Speech Recognition Work?
Words: grey whales
Phonemes: g r ey w ey l z
Each phonemehas different characteristics(for example,The power distribution).
![Page 19: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/19.jpg)
Speech Recognition
g g r ey ey ey ey w ey ey l l z
How do we “match” the word when there are time and other variations?
![Page 20: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/20.jpg)
Hidden Markov Model
S1 S2
S3
P12
{a,b,c,…}
{a,b,c,…}
{a,b,c,…}
![Page 21: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/21.jpg)
Dynamic Programming in Decoding
time
states
We can find a path that corresponds to max-probable phonemesto generate the observation “feature” (extracted in eachspeech frame) sequence.
![Page 22: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/22.jpg)
HMM for a Unigram Language Model
HMM1(word1)
HMM2(word2)
HMM3(wordn)
p1
p2
p3
s0
![Page 23: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/23.jpg)
Speech Synthesis
Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.)
Speech synthesis has been widely used for text-to-speech systems and different telephone services.
The easiest and most often used speech synthesis
method is waveform concatenation.
Increase the pitch without changing the speed
![Page 24: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/24.jpg)
Speaker Recognition
Identifying or verifying the identity of a speaker is an application where computer exceeds human being.
Vocal track parameter can be used as a feature for speaker recognition.
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
LPC covariance featureSpeaker one Speaker two
![Page 25: Speech in Multimedia](https://reader036.vdocuments.site/reader036/viewer/2022062500/568158d6550346895dc61e6e/html5/thumbnails/25.jpg)
Applications
Speech recognitionCall routing
Directory Assistance
Operator Services
Document input
Speakerrecognition
Personalized service
Fraud Control
Text-to-Speechsynthesis
Speech Interface
Document Correction
Voice Commands
Speech Coding
Wireless Telephone
Voice over Internet