introduction to computer speech processing

Introduction to Computer Speech Processing

Alex AceroResearch Area ManagerMicrosoft Research

Outline

• Grand challenges in Speech and Language• Vision videos• Products today• Prototypes• The role of speech• Technology Introduction

User Expectations for Speech

The Turing Test

• Imitation Game:– Judge, man, and a woman– All chat via Email.– Man pretends to be a woman. – Man lies, woman tries to help judge.– Judge must identify man after 5 minutes.

• Turing Test– Replace man or woman with a computer.– Fool judge 30% of the time.

Thanks to Jim Gray for material

What Turing Said

“I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning. The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.”

Alan M.Turing, 1950“Computing machinery and intelligence.” Mind, Vol.

LIX. 433-460

Prediction 59 Years Later

• Turing’s technology forecast was great!– Gigabyte memory is common

• Computer beat world chess champion– with some help from its programming staff!

• Computers help design most things today

Prediction 59 Years Later

• Intelligence forecast was optimistic– Several internet sites offer Turning Test

chatterbots.– None pass (yet) http://www.loebner.net/Prizef/loebner-prize.html

• But I believe it will not be long:– less than 50 years, more than 10 years

• Turing test still stands as a long-term challenge

Challenges Implicit in the Turing Test

1. Read and understand as well as a human

2. Think and write as well as a human3. Hear as well as a native speaker:

Speech Recognition (speech to text)4. Speak as well as a native speaker:

Speech Synthesis (text to speech)5. Remember what is heard and quickly

return it on request.

Moore’s law (1965)

• Gordon Moore: “The number of transistors per chip will

double every 18 months”: 100x per decade• Progress in next 18 months

= ALL previous progress– New storage = sum of all old storage (ever)– New processing = sum of all old processing.

15 years ago

Making Chips Smaller

• Advances in Lithography: science of "drawing" circuits on chips

• Impact of Moore’s law:– Short distances => smaller processing time– Smaller size => lower cost per transistor– Amount of memory is increased

• But, it is not a law of physics: a mere self fulfilling prophecy.

Moore’s law not applicable to Machine Intelligence

• Speech technology benefited from Moore’s Law in the 1990’s.

• In the 21th century, faster chips mean recognition error appears faster

• New algorithmic advances needed to pass the Turing Test• Error rate halves approx every 7 years

Grand Challenges

“Within 10 years speech will be in every device. Things like speech and ink are so natural, when they get the right quality level they will be in everything. As technical hurdles such as background noise and context are overcome, major adoption of speech technology will arrive. Soon, dictating to PCs and giving commands to cell phones will be basic modes of interacting with technology”

Bill Gates, March 2004

Outline


Speech in Mobile devices

Speech for Students

Speech in cars

Soccer Mom in car

Insurance Agent driving

Outline


Japanese dictation

Telephony: Response point

Directory Assistance

• Automatic generation of robust grammars– Users say “Calabria” or “Calabria restaurant”

• Nearby cities– Is “Calabria restaurant” in Redmond or Kirkland?

• Some people say the address too– “Pizza hut on 3rd Avenue” in New York, New York

• Automatic normalization– Acronyms, compound words, homonyms, misspelled words

Multimodal voice search

Click-Driven Automated Feedback

Acoustic ModelLanguage Model

Outline


CommuteUX

Speech in Education

VerbalMath

Virtual Receptionist

Video Search(Frank Seide, MSRA)

http://msrardemos/AudioSearch/default.aspx?index=internetav

Browsing a Video (Milind Mahajan & Patrick Nguyen)

Podcast authoring (Patrick Nguyen)

Outline


HighHigh

InternetInternetTVTV

PhonePhone

PDAPDA

Ease of text input (keyboard/pen)Ease of text input (keyboard/pen)

Ease Ease of GUIof GUI

(screen/(screen/Pointer)Pointer)

LowLow HighHigh

PCPC

TabletTabletPCPC

ScreenScreenPhonePhoneScreenScreenPhonePhone

PDAPDA

TabletTabletPCPC

CarCarCarCar


Role of Speech in Different Devices

PhonePhone

PCPC

ScreenScreenPhonePhone

PDAPDA

TabletTabletPCPC

CarCar


A Roadmap for Speech

Ease of text input (keyboard/pen)Ease of text input (keyboard/pen)

Ease Ease of GUIof GUI

(screen/(screen/Pointer)Pointer)

HighHigh

HighHighLowLow

Speech-Only Speech-Only TelephonyTelephony

DictationDictation

Multimodal Multimodal Command/ControlCommand/Control

Speech Technology

Meeting / Voicemail Transcription

Market Opportunity

Mobile Devices / Cars

Telephony / Call Center

Accessibility

Desktop Dictation

Desktop Command & Control

Technology Readiness

Customer Need

Poor Alternative

Outline


Voice-enabled System Technology Components

DM

SLU

TTSText-to-Speech

Synthesis

Automatic SpeechRecognition

Spoken LanguageUnderstanding

DialogManagement

ASR

SLGSpoken Language Generation

Data,Rules

Words

Meaning

SpeechSpeech

Action

Words

Basic Formulation

• Basic equation of speech recognition is

X=X1,X2,…,Xn is the acoustic observation is the word sequence

P(X|W) is the acoustic model

P(W) is the language model

WpWXpXWpWWW

|maxarg|maxargˆ

Feature Extraction

Feature Extraction

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

Pattern Classification

(Decoding, Search)


(Decoding, Search)

Acoustic Model

Acoustic Model

Input Speech “Hello World”

(0.9) (0.8)

Speech Recognition

SLU

TTS ASR

DM

SLG

Goal: Extract robust features (information)from the speech that are relevant for ASR.

Method: Spectral analysis through either abank-of-filters or through Linear Predictive Codingfollowed by non-linearity and normalization.

Result: Signal compression where for each window of speech samples where 30 or so features are extracted (64,000 b/s -> 5,200 b/s).

Challenges: Robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (accents, dialect, style, speaking defects), noise and echo.

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

Feature Extraction

Goal:Model probability of acoustic features for each phone model i.e. p(X |/ae/)

Method: Hidden Markov Models (HMM) throughMaximum likelihood (EM) or discriminative methods

Challenges/variability: • Background noise: Cocktail Party Effect• Dialect/accent• Speaker• Phonetic context: “It aly” vs “It alian””• No spaces in speech:

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

Acoustic Modeling

“Recognize speech” “Wreck a nice beach”

0 21

Goal:Map legal phone sequences into wordsaccording to phonotactic rules:

David /d/ /ey/ /v/ /ih/ /d/

Multiple Pronunciations:Several words may have multiple pronunciations:

Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/

Challenges: •How do you generate a word lexicon automatically?

•LTS rules can be automatically trained with decision trees (CART) less than 8% errors, but proper nouns are hard!

•How do you add new variant dialects and word pronunciations?

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

Word Lexicon


Goal:Find “optimal” word sequence:Combine information (probabilities) from• Acoustic model• Word lexicon• Language model

Method:Decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm

Challenge:Efficient search through a large network space is computationally expensive for large vocabulary ASR: Beam search, WFST

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

Confidence ScoringGoal:Identify possible recognition errors and out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM.

Method:A confidence score based on a hypothesis likelihood ratio test is associated with each recognized word:

Label: credit please Recognized: credit fees Confidence: (0.9) (0.3)

Command-and-control: false rejection and false acceptance => ROC curvesChallenges:Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring


DM

SLU

TTSText-to-Speech

Synthesis



DialogManagement

ASR


Data,Rules

Words

Meaning

SpeechSpeech

Action

Words

Text-to-Speech Systems

TTS Engine

Text AnalysisDocument Structure DetectionText NormalizationLinguistic Analysis

Phonetic AnalysisHomograph disambiguationGrapheme-to-Phoneme Conversion

Speech SynthesisVoice Rendering

Raw textor tagged text

tagged text

controls

Prosodic AnalysisPitch & Duration Attachment

tagged phones

SpeechAudio Out

Multimedia Customer Care(Courtesy of AT&T)


DM

SLU

TTSText-to-Speech

Synthesis



DialogManagement

ASR


Data,Rules

Words

Meaning

SpeechSpeech

Action

Words

Language Understanding

• Application Schema (XML for semantic entities) defines the application status

• A Semantic Context Free Grammar (CFG) parses an English sentence and fills in slots of the application schema.

Application Schema

<itinerary><origin>

<city></city><state></state>

</origin><destination>

<city></city><state></state>

</destination><date></date>

</itinerary>

Semantic CFG

<rule name=“itinerary”>

Show me flights from <ruleref name=“origin"/>

to <ruleref name=“destination"/>

</rule>

<rule name=“origin”>

<ruleref name=“city”>

</rule>

<rule name=“destination”>

<ruleref name=“city”>

</rule>

<rule name=“city”>

Seattle | San Francisco | New York

</rule>

An example sentence

“Show me flights from Seattle to New York”

would populate the application schema as<itinerary>

<origin>

<city>Seattle</city>

<state></state>

</origin>

<destination>

<city>New York</city>

<state></state>

</destination>

<date></date>

</itinerary>


DM

SLU

TTSText-to-Speech

Synthesis



DialogManagement

ASR


Data,Rules

Words

Meaning

SpeechSpeech

Action

Words

Who manages the Dialog?Directed Dialog

– “Who would you like to contact?”– Finite State Machine– Simple CFG– MSConnect

User Initiative Dialog “What can I do for you?” Ngrams Windows Airlines

Initiative

Reservations

Flight Status

Baggage Claim

Special Announcements

Problems with directed dialogs

User-initiative dialogs

• Pros:– Can result in a shorter call– Can feel more natural– Useful when too many choices

• Cons:– Requires expensive expertise– Could lead to user frustration: system appears human

but caller can’t use full natural language

NLU Dialog Module

• Drag-and-drop Dialog Flow Designer• Developer specifies:

– Destination branches– Example sentences per branch– Prompts (initial, mumble, no speech, etc)

• Module generates SLM and classifier• It handles confirmation, reprompt, etc.

Natural Language

VisualPen

Gesture

Multimodal System Technology Components

DM

SLU

TTSText-to-Speech

Synthesis



DialogManagement

ASR


Data,Rules

Words

Meaning

SpeechSpeech

Action

Words

MIPad

• Multimodal Interactive Pad• MiPad

– Tap and Talk combines speech and pen

– Use context to simplify recognition– Dictation allows complex command

entry

• Usability studies show double throughput for English

• Speech is mostly useful in cases with lots of alternatives

Speech-centric Multimodal

Multimodality Benefits

• Compared to speech-only:– User sees system response more quickly– User sees what system understood– User can know what system expects

• Compared to GUI-only:– Faster entry– Better use of small screen

But general language understanding is hard

introduction to computer speech processing

Documents

years speech

speech recognition speech

turing testreplace man

years time

turing testread

chipsimpact of moores

speech synthesis text

law of physics