speech recognitions

7/29/2019 Speech Recognitions

1/70

Speech Recognition


2/70

Definition

Speech recognition is the process of convertingan acoustic signal, captured by a microphone ora telephone, to a set of words.

The recognised words can be an end inthemselves, as for applications such ascommands & control, data entry, and documentpreparation.

They can also serve as the input to furtherlinguistic processing in order to achieve speechunderstanding


3/70

Speech Processing

Signal processing: Convert the audio wave into a sequence of feature vectors

Speech recognition: Decode the sequence of feature vectors into a sequence

of words Semantic interpretation: Determine the meaning of the recognized words

Dialog Management: Correct errors and help get the task done

Response Generation What words to use to maximize user understanding

Speech synthesis (Text to Speech): Generate synthetic speech from a marked-up word string


4/70

Dialog Management

Goal: determine what to accomplish in responseto user utterances, e.g.:

Answer user question

Solicit further information Confirm/Clarify user utterance

Notify invalid query

Notify invalid query and suggest alternative

Interface between user/language processingcomponents and system knowledge base


5/70

What you can do with Speech

Recognition

Transcription

dictation, information retrieval

Command and control data entry, device control, navigation, callrouting

Information access

airline schedules, stock quotes, directoryassistance

Problem solving

travel planning, logistics


6/70

Transcription and Dictation

Transcription is transforming a stream ofhuman speech into computer-readableform

Medical reports, court proceedings, notes

Indexing (e.g., broadcasts)

Dictation is the interactive composition of

text Report, correspondence, etc.


7/70

Speech recognition andunderstanding

Sphinx system

speaker-independent

continuous speech

large vocabulary

ATIS system

air travel information retrieval

context management


8/70

Speech Recognition and CallCentres

Automate services, lower payroll

Shorten time on hold

Shorten agent and client call time

Reduce fraud

Improve customer service


9/70

Applications related to Speech

Recognition

Speech Recognition

Figure out what a person is saying.

Speaker Verification

Authenticate that a person is who she/heclaims to be.

Limited speech patterns

Speaker Identification

Assigns an identity to the voice of anunknown person.

Arbitrary speech patterns


10/70

Many kinds of Speech RecognitionSystems

Speech recognition systems can becharacterised by many parameters.

An isolated-word (Discrete) speechrecognition system requires that thespeaker pauses briefly between words,whereas a continuous speech recognition

system does not.


11/70

Spontaneous V Scripted

Spontaneous, speech containsdisfluencies, periods of pause and restart,and is much more difficult to recognise

than speech read from script.


12/70

Enrolment

Some systems require speaker enrolment,a user must provide samples of his or herspeech before using them, whereas other

systems are said to be speaker-independent, in that no enrolment isnecessary.


13/70

Large V small vocabularies

Some of the other parameters depend on thespecific task. Recognition is generally moredifficult when vocabularies are large with manysimilar-sounding words.

When speech is produced in a sequence ofwords, language models or artificial grammarsare used to restrict the combination of words.

The simplest language model can be specifiedas a finite-state network, where the permissiblewords following each word are given explicitly.


14/70

Perplexity

One popular measure of the difficulty ofthe task, combining the vocabulary sizeand the language model, is perplexity.

Loosely defined as the geometric mean ofthe number of words that can follow aword after the language model has been

applied., (Zue, Cole, and Ward, 1995).


15/70

Finally, some external parameters canaffect speech recognition systemperformance. These include the

characteristics of the environmental noiseand the type and the placement of themicrophone.


16/70

Properties of RecognizersSummary

Speaker Independent vs. Speaker Dependent Large Vocabulary (2K-200K words) vs.

Limited Vocabulary (2-200)

Continuous vs. Discrete

Speech Recognition vs. Speech Verification Real Time vs. multiples of real time


17/70

Continued

Spontaneous Speech vs. Read Speech Noisy Environment vs. Quiet Environment High Resolution Microphone vs. Telephone vs.

Cellphone Push-and-hold vs. push-to-talk vs. always-

listening Adapt to speaker vs. non-adaptive Low vs. High Latency With online incremental results vs. final results Dialog Management


18/70

Features That Distinguish

Products & Applications

Words, phrases, and grammar

Models of the speakers

Speech flow

Vocabulary: How many words

How you add new words

GrammarsBranching Factor (Perplexity)

Available languages


19/70

Systems are also defined by Users

Different Kinds of Users

One time vs. Frequent users

Homogeneity Technically sophisticated

Based on Users have different speaker

models


20/70

Speaker Models

Speaker Dependent

Speaker Independent

Speaker Adaptive


21/70

Automate services, lower

payroll

Shorten time on hold

Shorten agent and client call

timeReduce fraud

Improve customer service

Sample Market: Call Centers


22/70

A TIMELINE OF SPEECHRECOGNITION

1890s Alexander Graham Bell discovers Phone whiletrying to develop speech recognition system for deafpeople.

1936AT&T's Bell Labs produced the first electronic

speech synthesizer called the Voder (Dudley, Riesz andWatkins).

This machine was demonstrated in the 1939 World Fairsby experts that used a keyboard and foot pedals to playthe machine and emit speech.

1969John Pierce of Bell Labs said automatic speechrecognition will not be a reality for several decadesbecause it requires artificial intelligence.


23/70

Early 70s

Early 1970'sThe Hidden Markov Modeling(HMM) approach to speech recognition wasinvented by Lenny Baum of Princeton Universityand shared with several ARPA (AdvancedResearch Projects Agency) contractors includingIBM.

HMM is a complex mathematical pattern-matching strategy that eventually was adoptedby all the leading speech recognition companiesincluding Dragon Systems, IBM, Philips, AT&Tand others.


24/70

70+

1971DARPA (Defense Advanced Research Projects Agency)established the Speech Understanding Research (SUR) program todevelop a computer system that could understand continuousspeech.

Lawrence Roberts, who initiated the program, spent $3 million peryear of government funds for 5 years. Major SUR project groups

were established at CMU, SRI, MIT's Lincoln Laboratory, SystemsDevelopment Corporation (SDC), and Bolt, Beranek, and Newman(BBN). It was the largest speech recognition project ever.

1978The popular toy "Speak and Spell" by Texas Instruments wasintroduced. Speak and Spell used a speech chip which led to huge

strides in development of more human-like digital synthesis sound.


25/70

80+

1982Covox founded. Company brought digital sound (viaThe Voice Master, Sound Master and The SpeechThing) to the Commodore 64, Atari 400/800, and finallyto the IBM PC in the mid 80s.

1982Dragon Systems was founded in 1982 by speechindustry pioneers Drs. Jim and Janet Baker. DragonSystems is well known for its long history of speech andlanguage technology innovations and its large patentportfolio.

1984SpeechWorks, the leading provider of over-the-telephone automated speech recognition (ASR)solutions, was founded.


26/70

90s

1993 Covox sells its products out to Creative Labs, Inc. 1995 Dragon released discrete word dictation-level speech

recognition software. It was the first time dictation speechrecognition technology was available to consumers. IBM andKurzweil followed a few months later.

1996 Charles Schwab is the first company to devote resourcestowards developing up a speech recognition IVR system withNuance. The program, Voice Broker, allows for up to 360simultaneous customers to call in and get quotes on stock andoptions... it handles up to 50,000 requests each day. The systemwas found to be 95% accurate and set the stage for othercompanies such as Sears, Roebuck and Co., and United Parcel

Service of America Inc., and E*Trade Securities to follow in theirfootsteps. 1996 BellSouth launches the world's first voice portal, called Val

and later Info By Voice.


27/70

95+

1997 Dragon introduced "Naturally Speaking", the first"continuous speech" dictation software available(meaning you no longer need to pause between wordsfor the computer to understand what you're saying).

1998 Lernout & Hauspie bought Kurzweil. Microsoftinvested $45 million in Lernout & Hauspie to form apartnership that will eventually allow Microsoft to usetheir speech recognition technology in their systems.

1999 Microsoft acquired Entropic, giving Microsoft

access to what was known as the "most accurate speechrecognition system" in the Old VCR!


28/70

2000

2000 Lernout & Hauspie acquired Dragon Systemsfor approximately $460 million.

2000 TellMe introduces first world-wide voiceportal.

2000 NetBytel launched the world's first voiceenabler, which includes an on-line orderingapplication with real-time Internet integration forOffice Depot.


29/70

2000s

2001ScanSoft Closes Acquisition of Lernout& Hauspie Speech and Language Assets.

2003ScanSoft Ships Dragon

NaturallySpeaking 7 Medical, LowersHealthcare Costs through Highly AccurateSpeech Recognition.

2003ScanSoft closes deal to distribute andsupport IBM ViaVoice Desktop Products.


30/70

Signal Variability

Speech recognition is a difficult problem, largely becauseof the many sources of variability associated with thesignal.

The acoustic realisations of phonemes, the recognitionsystems smallest sound units of which words are

composed, are highly dependent on the context in whichthey appear.

These phonetic variables are exemplified by the acousticdifferences of the phoneme 't/'in two, true, and butter inEnglish.

At word boundaries, contextual variations can be quitedramatic, and devo andare sound like devandare inItalian.


31/70

More

Acoustic variability can result from changes inthe environment as well as in the position andcharacteristics of the transducer.

Within-speaker variability can result fromchanges in the speaker's physical and emotionalstate, speaking rate, or voice quality.

Differences in socio-linguistic background,dialect, and vocal tract size and shape cancontribute to across-speaker variability.


32/70

What is a speech recognitionsystem?

Speech recognition is generally used as ahuman computer interface for other software.When it functions in this role, three primary tasks

need be performed. Pre-processing, the conversion of spoken input

into a form the recogniser can process. Recognition, the identification of what has been

said. Communication, to send the recognised input tothe application that requested it.


33/70

How is pre-processing performed

To understand how the first of thesefunctions is performed, we must examine,

Articulation, the production of the sound.

Acoustics, the stream of the speech itself.

What characterises the ability tounderstand spoke input, Auditoryperception.


34/70

Articulation

The science of articulation is concerned with howphonemes are produced. The focus of articulation is onthe vocal apparatus of the throat, mouth and nose wherethe sounds are produced.

The phonemes themselves need to be classified, thesystem most often used by speech recognition is theARPABET, (Rabiner and Juang, 1993) The ARPABETwas created in the 1970s by and for contractors workingon speech processing for the Advanced Research

Projects Agency of the U.S. department of defence.


35/70

ARPABET

Like most phoneme classifications, theARPABET separates consonants from vowels.

Consonants are characterised by a total orpartial blockage of the vocal tract.

Vowels are characterised by strong harmonicpatterns and relatively free passage of air

through the vocal tract. Semi-Vowels, such as the y in you, fall between

consonants and vowels.


36/70

Consonant Classifcation

Consonant classification uses the,

Point of articulation.

Manner of articulation. Presence or absence of voicing.


37/70

Acoustics

Articulation provides valuable informationabout how speech sounds are produced,but a speech recognition system cannot

analyse movements of the mouth. Instead, the data source for speech

recognition is the stream of speech itself.

This is an analogue signal, a soundstream, and a continuous flow of soundwaves and silence.


38/70

Important Features (Acoustics)

Four important features of the acoustic analysisof speech are, (Carter, 1984)

Frequency, the number of vibrations per second

a sound produces Amplitude, the loudness of the sound.

Harmonic structure added to the fundamentalfrequency of a sound are other frequencies thatcontribute to its quality or timbre.

Resonance.


39/70

Auditory perception, hearingspeech.

"Phonemes tend to be abstractions that are implicitlydefined by the pronunciation of the words in thelanguage. In particular, the acoustic realisation of aphoneme may heavily depend on the acoustic context in

which it occurs. This effect is usually called co-articulation", (Ney, 1994).

The way a phoneme is pronounced can be affected byits position in a word, neighbouring phonemes and eventhe word's position in a sentence. This affect is called the

co-articulation effect. The variability in the speech signal caused by co-

articulation and other sources make speech analysisvery difficult.


40/70

Human Hearing

The human ear can detect frequencies from 20Hz to20,000Hz but it is most sensitive in the critical frequencyrange, 1000Hz to 6000Hz, (Ghitza, 1994).

Recent Research has uncovered the fact that humansdo not process individual frequencies.

Instead, we hear groups of frequencies, such as formatpatterns, as cohesive units and we are capable ofdistinguishing them from surrounding sound patterns,(Carrell and Opie, 1992) .

This capability, called auditory object formation, or

auditory image formation, helps explain how humans candiscern the speech of individual people at cocktail partiesand separate a voice from noise over a poor telephonechannel, (Markowitz, 1995).


41/70

Pre-processing Speech

Like all sounds, speech is an analoguewaveform. In order for a Recognition System toperform action on speech, it must berepresented in a digital manner.

All noise patterns silences and co-articulationeffects must be captured.

This is accomplished by digital signalprocessing. The way the analogue speech isprocessed is one of the most complex elementsof a Speech Recognition system.


42/70

Recognition Accuracy

To achieve high recognition accuracy thespeech representation process should,(Markowitz, 1995),

Include all critical data.

Remove Redundancies.

Remove Noise and Distortion.

Avoid introducing new distortions.


43/70

Signal Representation

In statistically based automatic speechrecognition, the speech waveform is sampled ata rate between 6.6 kHz and 20 kHz andprocessed to produce a new representation as asequence of vectors containing values of whatare generally called parameters.

The vectors typically comprise between 10 and20 parameters, and are usually computed every10 or 20 milliseconds.


44/70

Parameter Values

These parameter values are then used insucceeding stages in the estimation of theprobability that the portion of waveform justanalysed corresponds to a particular phonetic

event that occurs in the phone-sized or whole-word reference unit being hypothesised.

In practice, the representation and theprobability estimation interact strongly: what one

person sees as part of the representationanother may see as part of the probabilityestimation process.


45/70

Emotional State

Representations aim to preserve the informationneeded to determine the phonetic identity of aportion of speech while being as impervious as

possible to factors such as speaker differences,effects introduced by communications channels,and paralinguistic factors such as the emotionalstate of the speaker.

They also aim to be as compact as possible.


46/70

Representations used in current speechrecognisers, concentrate primarily on propertiesof the speech signal attributable to the shape ofthe vocal tract rather than to the excitation,

whether generated by a vocal-tract constrictionor by the larynx.

Representations are sensitive to whether thevocal folds are vibrating or not (thevoiced/unvoiced distinction), but try to ignoreeffects due to variations in their frequency ofvibration.

F t I t i S h


47/70

Future Improvements in SpeechRepresentation.

The vast majority of major commercial andexperimental systems use representations akinto those described here.

However, in striving to develop betterrepresentations, wave-let transforms(Daubechies, 1990) are being explored, and

neural network methods are being used toprovide non-linear operations on log spectralrepresentations.


48/70

Work continues on representations more closelyreflecting auditory properties (Greenberg, 1988) and onrepresentations reconstructing articulatory gestures fromthe speech signal (Schroeter & Sondhi, 1994).

It is attractive because it holds out the promise of a smallset of smoothly varying parameters that could deal in asimple and principled way with the interactions that occurbetween neighbouring phonemes and with the effects of

differences in speaking rate and of carefulness ofenunciation.


49/70

The ultimate challenge is to match the superiorperformance of human listeners over automaticrecognisers.

This superiority is especially marked when there is littlematerial to allow adaptation to the voice of the current

speaker, and when the acoustic conditions are difficult. The fact that it persists even when nonsense words are

used shows that it exists at least partly at theacoustic/phonetic level and cannot be explained purelyby superior language modelling in the brain.

It confirms that there is still much to be done indeveloping better representations of the speech signal,(Rabiner and Schafer, 1978; Hunt, 1993).


50/70

Signal Recognition Technologies

Signal Recognition methodologies fall intoto four categories, most system will applyone or more in the conversion process.


51/70

Template Matching,

Template match is the oldest and least effective method.It is a form of pattern recognition.

It was the dominant technology in the 1950's and 1960's. Each word or phrase in an application is stored as a

template. The user input is also arranged into templates at the

word level and the best match with a system template isfound.

Although Template matching is currently in decline asthe basic approach to recognition, it has been adaptedfor use in word spotting applications. It also remains theprimary technology applied to speaker verification,(Moore, 1982).


52/70

Acoustic-Phonetic Recognition

Acoustic-phonetic recognition functions at thephoneme level. It is an attractive approach tospeech as it limits the number of representationsthat must be stored. In English there are about

forty discernible phonemes no matter how largethe vocabulary, (Markowitz, 1995). Acoustic phonetic recognition involves three

steps,Feature Extraction.Segmentation and Labelling.Word-Level recognition.


53/70

Acoustic phonetic recognition supplantedtemplate matching in the early 1970's.

The successful ARPA SUR systems

highlighted potential benefits of thisapproach. Unfortunately acoustic phoneticwas at the time a poorly researched area

and many of the expected advances failedto materialise.


54/70

The high degree of acoustic similarity amongphonemes combined with phoneme variabilityresulting from the co-articulation effect and other

sources create uncertainty with regard topotential phoneme labels, (Cole 1986).

If these problems can be overcome, there iscertainly an opportunity for this technology to

play a part in future Speech Recognition system.


55/70

Stochastic Processing,

The term stochastic refers to the process of making asequence of non-deterministic selections from among aset of alternatives.

They are non-deterministic because the choices duringthe recognition process are governed by the

characteristics of the input and not specified in advance,(Markowitz, 1995). Like template matching, stochastic processing requires

the creation and storage of models of each of the itemsthat will be recognised.

It is based on a series of complex statistical orprobabilistic analyses. These statistics are stored in anetwork-like structure called a Hidden Markov Model(HMM), (Paul, 1990).


56/70

HMM

A Hidden Markov Model is made up of states andtransitions, which are shown, in the diagram. Each staterepresents of a HMM holds statistics for a segment of aword, which describe the value and variations that arefound in the model of that word segment. The transitions

allow for speech variations such as The prolonging of a word segment, this would causeseveral recursive transitions in the recogniser.

The omission of a word segment, This would cause atransition that skips a state.

Stochastic processing using Hidden Markov Models isaccurate, flexible, and capable of being fully automated,(Rabiner and Juang, 1986).


57/70

Neural networks

"if speech recognition systems could learn speechknowledge automatically and represent this knowledgein a parallel distributed fashion for rapid evaluation such a system would mimic the function of the humanbrain, which consists of several billion simple, inaccurate

and slow processors that perform reliable speechprocessing", (Waibel and Hampshire, 1989).

An artificial neural network is a computer program, whichattempt to emulate the biological functions of the Human

brain. They are an excellent classification systems, andhave been effective with noisy, patterned, variable datastreams containing multiple, overlapping, interacting andincomplete cues, (Markowitz, 1995).


58/70

Neural networks do not require the completespecification of a problem, learning instead throughexposure to large amount of example data. Neuralnetworks comprise of an input layer, one or more hiddenlayers, and one output layer. The way in which the nodesand layers of a network are organised is called thenetworks architecture.

The allure of neural networks for speech recognition liesin their superior classification abilities.

Considerable effort has been directed towardsdevelopment of networks to do word, syllable andphoneme classification.


59/70

Auditory Models,

The aim of auditory models to allow a SpeechRecognition system to screen all noise from thesignal and concentrate on the central speechpattern in a similar way to the Human Brain.

Auditory modelling offers the promise of beingable to develop robust Speech Recognitionsystems that are capable of working in difficultenvironments.

Currently, it is purely an experimentaltechnology.

Performance of Speech


60/70

Performance of SpeechRecognitions systems

Performance of speech recognition systems is typicallydescribed in terms of word error rate, defined as:

Deletion, The loss of a word within the original speech.The system outputs "A E I U" while the input was "A E I

O U". Substitution, The replacement of an element of the input,such as a word, with another. The system outputs "song"while the input was "long".

Insertion, The system adds an element to the input, such

as a word, when no word was input. The system outputs"A E I O U" while the input was "A E I U".

Speech Recognition as Assistive


61/70

Speech Recognition as AssistiveTechnology

Main use is as alternative Hands FreeData entry mechanism

Very effective

Much faster than switch access

Mainstream technology

Used in many applications where handsare needed for other things e.g. mobilephone while driving, in surgical theatres


62/70

Dictation is a big part of officeadministration and commercial speechrecognition systems are targeted at this

market.


63/70

Some interesting facts

Switch access users who were at around 5words per minute achieved 80 words withSR

This allowed them to do state exams

SR can be used for environmental controlsystems around the home e.g.

Open Curtains


64/70

People with speech impairment (DysarthicSpeech) have shown improved articulationafter using SR systems especially Discrete

systems

Reasons why SR may fail some


65/70

Reasons why SR may fail somepeople

Crowded room - Cannot have everyonetalking at once

Too many errors because all noises,

coughs, throat clearances etc are pickedup

Speech not good enough to use it

Not enough training Cognitive overhead too much for some

people


66/70

Too demanding physically Hard work totalk for a long time

Cannot be bothered with Initial Enrolment

Drinking- Adversely affects vocal cords

Smoking, Shouting, Dry Mouth and illnessall affect the vocal tract

Need to drink water

Room must not be too stuffy


67/70

Some links

The following are links to major speechrecognition links

Carnegie Mellon Speech


68/70

Carnegie Mellon SpeechDemos

CMU Communicator

Call: 1-877-CMU-PLAN (268-7526), also268-5144, or x8-1084

the information is accurate; you can use it foryour own travel planning

CMU Universal Speech Interface (USI) CMU Movie Line

Seems to be about apartments now

Call: (412) 268-1185

T l h D
http://www.speech.cs.cmu.edu/Communicator/http://www.speech.cs.cmu.edu/usi/http://www.speech.cs.cmu.edu/Movieline/http://www.speech.cs.cmu.edu/Movieline/http://www.speech.cs.cmu.edu/usi/http://www.speech.cs.cmu.edu/Communicator/


69/70

Telephone Demos

Nuancehttp://www.nuance.com

Banking: 1-650-847-7438

Travel Planning: 1-650-847-7427

Stock Quotes: 1-650-847-7423

SpeechWorkshttp://www.speechworks.com/demos/demos.htm

Banking: 1-888-729-3366

Stock Trading: 1-800-786-2571
http://www.nuance.com/http://www.speechworks.com/demos/demos.htmhttp://www.speechworks.com/demos/demos.htmhttp://www.nuance.com/


70/70

MIT Spoken Language SystemsLaboratoryhttp://www.sls.lcs.mit.edu/sls/whatwedo/applicati

ons.html Travel Plans (Pegasus): 1-877-648-8255

Weather (Jupiter): 1-888-573-8255

IBM http://www-3.ibm.com/software/speech/ Mutual Funds, Name Dialing: 1-877-VIA-

VOICE
http://www.sls.lcs.mit.edu/sls/whatwedo/applications.htmlhttp://www.sls.lcs.mit.edu/sls/whatwedo/applications.htmlhttp://www-3.ibm.com/software/speech/http://www-3.ibm.com/software/speech/http://www-3.ibm.com/software/speech/http://www-3.ibm.com/software/speech/http://www.sls.lcs.mit.edu/sls/whatwedo/applications.htmlhttp://www.sls.lcs.mit.edu/sls/whatwedo/applications.html

speech recognitions

Documents