seminar speech recognition a short overview e.m. bakker liacs media lab leiden university

Speech Recognition LIACS Media Lab Leiden University

Seminar

Speech Recognition

a Short Overview

E.M. Bakker

LIACS Media Lab

Leiden University


Introduction What is Speech Recognition?

SpeechRecognition

Words“How are you?”

Speech Signal

Goal: Automatically extract the string of words spoken from the speech signal

• Other interesting area’s:– Who is talker (speaker recognition, identification)– Speech output (speech synthesis)– What the words mean (speech understanding, semantics)


MessageSource

LinguisticChannel

ArticulatoryChannel

AcousticChannel

Observable: Message Words Sounds Features

Bayesian formulation for speech recognition:

• P(W|A) = P(A|W) P(W) / P(A)

Recognition ArchitecturesA Communication Theoretic Approach

Objective: minimize the word error rate

Approach: maximize P(W|A) during training

Components:

• P(A|W) : acoustic model (hidden Markov models, mixtures)

• P(W) : language model (statistical, finite state networks, etc.)

The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams).


Input Speech

Recognition ArchitecturesIncorporating Multiple Knowledge Sources

AcousticFront-end

AcousticFront-end

• The signal is converted to a sequence of feature vectors based on spectral and temporal measurements.

Acoustic ModelsP(A/W)

Acoustic ModelsP(A/W)

• Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which states model spectral structure and transitions model temporal structure.

RecognizedUtterance

SearchSearch

• Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence.

• The language model predicts the next set of words, and controls which models

are hypothesized.Language Model

P(W)


FourierTransform

FourierTransform

CepstralAnalysis

CepstralAnalysis

PerceptualWeighting

PerceptualWeighting

TimeDerivative

TimeDerivative

Time Derivative

Time Derivative

Energy+

Mel-Spaced Cepstrum

Delta Energy+

Delta Cepstrum

Delta-Delta Energy+

Delta-Delta Cepstrum

Input Speech

• Incorporate knowledge of the nature of speech sounds in measurement of the features.

• Utilize rudimentary models of human perception.

Acoustic ModelingFeature Extraction

• Measure features 100 times per sec.

• Use a 25 msec window forfrequency domain analysis.

• Include absolute energy and 12 spectral measurements.

• Time derivatives to model spectral change.


• Acoustic models encode the temporal evolution of the features (spectrum).

• Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation.

• Phonetic model topologies are simple left-to-right structures.

• Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models.

• Sharing model parameters is a common strategy to reduce complexity.

Acoustic ModelingHidden Markov Models


• Closed-loop data-driven modeling supervised only from a word-level transcription.

• The expectation/maximization (EM) algorithm is used to improve our parameter estimates.

• Computationally efficient training algorithms (Forward-Backward) have been crucial.

• Batch mode parameter updates are typically preferred.

• Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge.

Acoustic ModelingParameter Estimation

• Initialization

• Single Gaussian Estimation

• 2-Way Split

• Mixture Distribution Reestimation

• 4-Way Split

• Reestimation

•••


Language ModelingIs A Lot Like Wheel of Fortune


Language ModelingN-Grams: The Good, The Bad, and The Ugly

Bigrams (SWB):

• Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think”• Rank-100: “do it”, “that we”, “don’t think”• Least Common: “raw fish”, “moisture content”,

“Reagan Bush”

Trigrams (SWB):

• Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know”

• Rank-100: “it was a”, “you know that”• Least Common: “you have parents”,

“you seen Brooklyn”

Unigrams (SWB):

• Most Common: “I”, “and”, “the”, “you”, “a”• Rank-100: “she”, “an”, “going”• Least Common: “Abraham”, “Alastair”, “Acura”


Language ModelingIntegration of Natural Language

• Natural language constraints can be easily incorporated.

• Lack of punctuation and search space size pose problems.

• Speech recognition typically produces a word-level

time-aligned annotation.

• Time alignments for other levels of information also available.


• Typical LVCSR systems have about 10M free parameters, which makes training a challenge.

• Large speech databases are required (several hundred hours of speech).

• Tying, smoothing, and interpolation are required.

Implementation Issues Search Is Resource Intensive

Megabytes of Memory

FeatureExtraction

(1M)

Acoustic Modeling

(10M)

LanguageModeling (30M)

Search(150M)

Percentage of CPUFeature

Extraction1%

LanguageModeling

15%

Search 25% Acoustic

Modeling 59%


• Dynamic programming is used to find the most probable path through the network.

• Beam search is used to control resources.

Implementation IssuesDynamic Programming-Based Search

• Search is time synchronous and left-to-right.

• Arbitrary amounts of silence must be permitted between each word.

• Words are hypothesized many times with different start/stop times, which significantly increases search complexity.


• Cross-word Decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries.

• Cross-word decoding significantly increases memory requirements.

Implementation IssuesCross-Word Decoding Is Expensive


General Specification


Applications Conversational Speech

• Conversational speech collected over the telephone contains background noise, music, fluctuations in the speech rate, laughter, partial words, hesitations, mouth noises, etc.

• WER (Word Error Rate) has decreased from 100% to 30% in six years.

• Laughter

• Singing

• Unintelligible

• Spoonerism

• Background Speech

• No pauses

• Restarts

• Vocalized Noise

• Coinage


ApplicationsAudio Indexing of Broadcast News

Broadcast news offers some uniquechallenges:• Lexicon: important information in infrequently occurring words

• Acoustic Modeling: variations in channel, particularly within the same segment (“ in the studio” vs. “on location”)

• Language Model: must adapt (“ Bush,” “Clinton,” “Bush,” “McCain,” “???”)

• Language: multilingual systems? language-independent acoustic modeling?


• From President Clinton’s State of the Union address (January 27, 2000):

“These kinds of innovations are also propelling our remarkable prosperity...Soon researchers will bring us devices that can translate foreign languagesas fast as you can talk... molecular computers the size of a tear drop with thepower of today’s fastest supercomputers.”

Applications Real-Time Translation

• Imagine a world where:

• You book a travel reservation from your cellular phone while driving in your car without ever talking to a human (database query)

• You converse with someone in a foreign country and neither speakerspeaks a common language (universal translator)

• You place a call to your bank to inquire about your bank account and never have to remember a password (transparent telephony)

• You can ask questions by voice and your Internet browser returns answers to your questions (intelligent query)

• Human Language Engineering: a sophisticated integration of many speech and language related technologies... a science for the next millennium.


A Generic Solution


A Pattern Recognition Formulation


Solution:Signal Modeling


Speech Recognition

Erwin M. BakkerLeiden University


THE SPEECH RECOGNITION PROBLEM

• Boundaries between words or phonemes

• Large variations in speaking rates

• in fluent speech words and word-endings are less pronounced

• Great deal of inter- as well as intra-speaker variability

• Quality of speech signal

• Task-inherent syntactic-semantic constraints should be exploited


SEARCH ALGORITHMS


STATISTICAL METHODS IN SPEECH RECOGNITION

• The Bayesian Approach

• Acoustic Models

• Language Models


a statistical speech recognition system


Acoustic Models (HMM)

Some typical HMM topologies used for acoustic modeling in large vocabulary speech recognition:(a) typical triphone,(b) short pause(c)silence. The shaded states denote the start and stop states for each model.


Language Models


SEARCH ALGORITHMS

• The Complexity of Search.

• Typical Search Algorithms– Viterbi Search– Stack Decoders– Multi-Pass Search– Forward-Backward Search


Hierarchical representation of the search space.


An outline of the Viterbi search algorithm


Simple overview of the stack decoding algorithm.


Multi-Pass Search


Complexity of Search

•lexicon: contains all the words in the system’s vocabulary along with their pronunciations (often there are multiple pronunciations per word)•acoustic models: HMMs that represent the basic sound units the system is capable of recognizing•language model: determines the possible word sequences allowed by the system (encodes knowledge of the syntax and semantics of the language)


References

Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone Hierarchical Search for Large Vocabulary Conversational Speech RecognitionIEEE Signal Processing Magazine, September 1999.

H. Ney and S. OrtmannsDynamic Programming Search for Continuous Speech RecognitionIEEE Signal Processing Magazine, September 1999.

V. ZueTalking with Your ComputerScientific American, August 1999


relative complexity of the search problem for large vocabulary

conversational

speech recognition


A TIME-SYNCHRONOUS VITERBI-BASED DECODER

• Complexity of Search– Network Decoding– N-Gram Decoding– Cross-Word Acoustic Models

• Search Space Organization– Lexical Trees– Language Model Lookahead– Acoustic Evaluation


Network decoding using word-internal context-dependent

models.

•The word network providing linguistic constraints •The pronunciation lexicon for the words involved •The network expanded using the corresponding word-internal triphones derived from the pronunciations of the words.


Search Space Organization


Lexical Tree


Generation of triphones



• Search Space Reduction– Pruning

setting pruning beams based on the hypothesis score limiting the total number of model instances active at a given time setting an upper bound on the number of words allowed to end at

a given frame

– Path Merging– Word Graph Compaction



• System Architecture

• PERFORMANCE ANALYSIS– a substitution error refers to the case where the decoder miss-recognizes a word in the

reference– sequence as another in the hypothesis.– a deletion error occurs when the there is no word recognized corresponding to a word in the– reference transcription.– an insertion error corresponds to the case where the hypothesis contains an extra word that

has no counterpart in the reference.



• scalability: Can the algorithm scale gracefully from small constrained tasks to large unconstrained

• tasks?

• recognition accuracy: How accurate is the best word sequence found by the system?

• word graph accuracy: Can the system generate alternate choices that contain the correct word

• sequence? How large must this list of choices be?

• memory: What memory is required to achieve optimal performance? How does performance vary

• with the amount of memory required?

• run-time: How many seconds of CPU time per second of speech are required (xRT) to achieve

• optimal performance? How does run-time vary with performance (run-time should decrease

• significantly as error rates increase)?



• Alphadigits

• Switchboard

• Beam Pruning

• MAPMI Pruning


Comparisons performed on a 333 MHz Pentium II processor

with 512MB RAM.


Forward-backward search


Introduction Speech in the Information Age

• Speech & text were revolutionary because of information access• New media and connectivity yield information overload• Can speech technology help?

Time

Source ofInformation Speech Text

Film, video, multimedia, voice mail,radio, television, conferences, web,on-line resources

Access toInformation

Listen,remember

Readbooks

Computertyping

Careful spoken,written input

Conversationallanguage


Conclusion and Future DirectionsTrends

We need new technology to help with information overload• Speech information sources are everywhere

– Voice mail messages– Professional talk– Lectures, broadcasts

• Speech sources of information will increase– As devices shrink– As mobility increases– New uses: annotation, documentation

Speech as Access Speech as Source Information as Partner

What are the words? What does it mean? Here’s what you need.


Conclusion and Future DirectionsApplications on the Horizon

Beginnings of speech as source of information

• ISLIP http://www.mediasite.net/info/frames.htm

• Virage http://www.virage.com Why

doesn’t belong inthe classroom• Beulah Arnott: also true of indoor plumbing

• BravoBrava: Co-evolving technology and people can– Dramatically reduce the cost of delivery of content– Increase its timeliness, quality and appropriateness– Target needs of individual and/or group – Reading Pal demo

Speech technology in education and training• Cliff Stoll, High Tech Heretic

–Good schools need no computers –Bad schools won’t be improved by them


OVERLAP IN THE CEPSTRAL SPACE (ALPHADIGITS)

The following plots demonstrate overlap of recognition features in the cepstral space. These plots consist of the vowels "aa" (as in "lock") and "iy" (as in "beat") excised from tokens in the OGI Alphadigit speech corpus.

In these plots, the first two cepstral coefficients are shown (c[1] and c[2]; energy, which is c[0], is not shown). Comparisons are provided as a function of the vowel spoken and the gender of the speaker:

Vowel Comparison: a comparison of male "aa" to male "iy"

Vowel Comparison: a comparison of female "aa" to female "iy"

Vowel Comparison: a combined plot of the above conditions

Gender Comparisons: a comparison of males and females for the vowels "aa" and "iy"

Combined Comparisons: a comparison of "aa" to "iy" for both genders

The Alphadigits vowel data used to generate these plots is available for classification experiments.


OVERLAP IN THE CEPSTRAL SPACE (SWB)

The following plots demonstrate overlap of recognition features in the cepstral space. These plots consist of the vowels "aa" (as in "lock") and "iy" (as in "beat") excised from tokens in the SWITCHBOARD conversational speech corpus.

In these plots, the first two cepstral coefficients are shown (c[1] and c[2]; energy, which is c[0], is not shown). Comparisons are provided as a function of the vowel spoken and the gender of the speaker:

•Vowel Comparison: a comparison of male "aa" to male "iy"

•Vowel Comparison: a comparison of female "aa" to female "iy"

•Vowel Comparison: a combined plot of the above conditions

•Gender Comparisons: a comparison of males and females for the vowels "aa" and "iy"

•Combined Comparisons: a comparison of "aa" to "iy" for both genders

The Switchboard vowel data used to generate these plots is available for classification experiments.


Implementation IssuesDecoding Example


Implementation IssuesInternet-Based Speech Recognition

seminar speech recognition a short overview e.m. bakker liacs media lab leiden university

Documents