speech technology. hot! what are the big players in the area up to? google – technology.html

57
Speech Technology

Upload: marilynn-flynn

Post on 01-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Speech Technology

Page 2: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

HOT!

Page 3: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

What are the big players in the area up to?

• Google– http://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-tech

nology.html• Microsoft

– http://gigaom.com/2010/12/06/microsoft-claims-its-place-in-a-voice-enabled-world/

• Apple– http://www.dailyfinance.com/story/company-news/apples-siri-purchase-h

eats-up-the-race-toward-a-voice-activated/19458344/• IBM

– http://www.ibm.com/news/in/en/2010/08/20/a896686u56875f96.html• Nuance

– http://gigaom.com/2011/01/19/nuance-releases-mobile-sdk-to-speechify-apps/

• Voxeo

Page 4: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Apple, and the case of Siri

• Siri: http://www.youtube.com/watch?v=MpjpVAB06O4

• Review of Siri: http://www.youtube.com/watch?v=AohzWSkAU7c&feature=watch_response

Page 5: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Types of dialog systems• by modality

– text-based– spoken– graphical user interface– multi-modal

• by device – telephone-based systems– PDA systems– in-car systems– robot systems– desktop/laptop systems

• native• in-browser systems• in-virtual machine

– in-virtual environment– robots

• by style – command-based– menu-driven– natural language

• by initiative – system initiative– user initiative– mixed initiative

• by application – information service– command-and-control– entertainment– education/tutorial– edutainment– reminder systems– companion systems– healthcare– eldercare– assistive/access systems

Page 6: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

More about application types

• Information providing systems: – weather reports – stock quotes – timetables– ...

• Transaction-based systems: – calendar functions – shopping – financial transactions – travel reservations– ...

Page 7: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Why Voice?

Page 8: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Why voice?

• Wireless devices have small screens and limited input capabilities.

• Telephone keypad can give users only a limited number of choices.

• Speech technology is improving.• The exchange of information between a person and a

computer is becoming more like a real conversation.• Users want hands-free or eyes-free use.• From a business viewpoint, voice applications open up

a host of new revenue opportunities.• There exist many more telephones than computers

with the potential to access the Internet.

Page 9: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Traditional Interactive Voice Response (IVR)

Page 10: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Speech versus Touch Tone

Page 11: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Architecture 1

Page 12: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Architecture 2

Page 13: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Today

• Presentation of project ideas

• TTS evaluation

• Short intro to XML

• Speech technology standards overview

• Speech Synthesis Markup Language (SSML)

• Presentation of home assignment 3: ASR evaluation

Page 14: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Project ideas?

Page 15: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Intro to XML

Page 16: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

W3C Speech Standards

Torbjörn Lager

Page 17: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

VoiceXML – a part of the web

Web servers

VoiceXML browser(ASR, TTS, interpreter)

VoiceXML

HTML browser

HTML

Page 18: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

The place of speech technology

• … speech technology itself has a very long way to go. … the most important thing may turn out to be not the speech technology itself, but the way in which speech technology connects to all the other technologies.

Tim Berners-Lee

Page 19: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

The What and Why of Standards

• Software standards include terminology, languages and protocols specified by committees of experts for widespread use in the software industry. Software standards have both advantages and disadvantages.

• Advantages:– developers can create applications using the standard languages that

are portable across a variety of platforms; – products from different vendors are able to interact with each other;– a community of experts evolves around the standard and is available to

develop products and services based on the standard. • Disadvantages:

– some developers feel that standards may inhibit creativity and stall the introduction of superior technology.

• However, in the area of speech, vendors are enthusiastic about standards and frequently complain that standards are not developed fast enough.

• Emerging speech-technology standards could give a boost to an industry hampered by proprietary software and hardware.

Page 20: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

World Wide Web Consortium

http://www.w3.org/

Page 21: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

W3C Speech Standards

• Speech Recognition Grammar Specification (SRGS) –• What the user can say

• Semantic Interpretation for Speech Recognition (SISR) –• What the user means

• Speech Synthesis Markup Language (SSML) – • What the user hears

• VoiceXML – • Dialog management: What the system is to do

Page 22: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Speech Recognition Grammar Specification (SRGS)

• Covers both speech and DTMF (Dual-Tone Multi-Frequency) input. (DTMF is valuable in noisy conditions or when the social context makes it awkward to speak.)

• Grammars can be specified in either an XML or an equivalent augmented BNF (ABNF) syntax.

– Speech recognition is an inherently uncertain process. Recognizers may report confidence values.

– If the utterance has several possible parses, the recognizer may be able to report the most likely alternatives (N-best results).

• What about statistical language models? Not covered by SRGS!

Page 23: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Semantic Interpretation for Speech Recognition (SISR)

<grammar root="answer">

<rule id="answer" scope="public"> <one-of> <item><ruleref uri="#yes"/></item> <item><ruleref uri="#no"/></item> </one-of> </rule>

<rule id="yes"> <one-of> <item>yes</item> <item>yeah<tag>yes</tag></item> <item><token>you bet</token><tag>yes</tag></item> <item xml:lang="fr-CA">oui<tag>yes</tag></item> </one-of> </rule>

<rule id="no"> <one-of> <item>no</item> <item>nope</item> <item>no way</item> </one-of> <tag>no</tag> </rule>

</grammar>

Page 24: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Semantic Interpretation for Speech Recognition (SISR)

• I would like a coca cola and three large pizzas with pepperoni and mushrooms

{ drink: { liquid:"coke", drinksize:"medium"}, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }}

Page 25: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

<grammar root="order">

<rule id="order"> I would like a <ruleref uri="#drink"/> <tag>out.drink = new Object(); out.drink.liquid=rules.drink.type; out.drink.drinksize=rules.drink.drinksize;</tag> and <ruleref uri="#pizza"/> <tag>out.pizza=rules.pizza;</tag> </rule> <rule id="kindofdrink"> <one-of> <item>coke</item> <item>pepsi</item> <item>coca cola<tag>out="coke";</tag></item> </one-of> </rule> <rule id="foodsize"> <tag>out="medium";</tag> <item repeat="0-1"> <one-of> <item>small<tag>out="small";</tag></item> <item>medium</item> <item>large<tag>out="large";</tag></item> <item>regular<tag>out="medium";</tag></item> </one-of> </item> </rule> <rule id="tops"> <tag>out=new Array;</tag> <ruleref uri="#top"/> <tag>out.push(rules.top);</tag> <item repeat="1-"> and <ruleref uri="#top"/> <tag>out.push(rules.top);</tag> </item> </rule> <rule id="top"> <one-of> <item>anchovies</item> <item>pepperoni</item> <item>mushroom<tag>out="mushrooms";</tag></item> <item>mushrooms</item> </one-of> </rule>

<rule id="drink"> <ruleref uri="#foodsize"/> <ruleref uri="#kindofdrink"/> <tag>out.drinksize=rules.foodsize; out.type=rules.kindofdrink;</tag> </rule> <rule id="pizza"> <ruleref uri="#number"/> <ruleref uri="#foodsize"/> <tag>out.pizzasize=rules.foodsize; out.number=rules.number;</tag> pizzas with <ruleref uri="#tops"/> <tag>out.topping=rules.tops;</tag> </rule> <rule id="number"> <one-of> <item> <tag>out=1;</tag> <one-of> <item>a</item> <item>one</item> </one-of> </item> <item>two<tag>out=2;</tag></item> <item>three<tag>out=3;</tag></item> </one-of> </rule></grammar>

I would like a coca cola and three large pizzas with pepperoni and mushrooms

{ drink: { liquid:"coke", drinksize:"medium“ }, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }}

Page 26: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Foundational

• Grammar (CFG, PSG)• Automata theory (FSMs, FSTs, etc)• Logic

• Phonetics• Linguistics• Computer science

Page 27: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Speech Synthesis Markup Language (SSML)

• The key concepts of SSML are– interoperability, or interacting with other

markup languages (VoiceXML, etc.); – consistency, or providing predictable control

of voice output across platforms and across speech synthesis implementations; and

– internationalization, or enabling speech output in a large number of languages within or across documents.

Page 28: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Speech Synthesis Markup Language (SSML) – An Example

<speak> <p> <s xml:lang="en-US"> <voice name="David" gender="male" age="25"> For English, press <emphasis>one</emphasis>. </voice> </s> <s xml:lang="es-MX"> <voice name="Miguel" gender="male" age="25"> Para español, oprima el <emphasis>dos</emphasis>. </voice> </s></p>

</speak>

Page 29: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Text Structure: p and s Elements

• A p element represents a paragraph. An s element represents a sentence.

<speak> <p> <s>This is the first sentence of the paragraph.</s> <s>Here's another sentence.</s> </p></speak>

Page 30: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

The phoneme Element

• The phoneme element provides a phonemic/phonetic pronunciation for the contained text.

<speak>

<phoneme alphabet="ipa“ ph="t&#x259;mei&#x325;&#x27E;ou&#x325;">tomato</phoneme>

</speak>

Page 31: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

The sub Element

• The sub element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form.

<?xml version="1.0"?><speak>

<sub alias="World Wide Web Consortium">W3C</sub>

</speak>

Page 32: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

The voice Element• The voice element is a production element that requests a change in speaking

voice. A selection of attributes is:– gender: optional attribute indicating the preferred gender of the voice to speak the

contained text. Enumerated values are: "male", "female", "neutral".– age: optional attribute indicating the preferred age in years (since birth) of the voice

to speak the contained text. – name: optional attribute indicating a processor-specific voice name to speak the

contained text.

<?xml version="1.0"?><speak>

<voice gender="female">Mary had a little lamb,</voice>

<!-- now request a different female child's voice --> <voice gender="female" age=“7">Its fleece was white as snow.</voice>

<!-- processor-specific voice selection --> <voice name="Mike">I want to be like Mike.</voice>

</speak>

Page 33: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

The emphasis Element

• The emphasis element requests that the contained text be spoken with emphasis.

<speak>

That is a <emphasis> big </emphasis> car!

That is a <emphasis level="strong"> huge </emphasis> bank account!

</speak>

Page 34: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

The break Element

• The break element is an empty element that controls the pausing or other prosodic boundaries between words.

<speak> Take a deep breath <break/> then continue. Press 1 or wait for the tone. <break time="3s"/> I didn't hear you! <break strength="weak"/> Please repeat.</speak>

Page 35: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

The prosody Element• The prosody element permits control of the pitch, speaking rate and volume of the speech output. • The attributes, all optional, are:

– pitch: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch levels.

– contour: sets the actual pitch contour for the contained text. The format is specified in Pitch contour below.– range: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will

vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch ranges.

– rate: a change in the speaking rate for the contained text. Legal values are: a relative change or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When a number is used to specify a relative change it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.

– duration: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [CSS2], e.g. "250ms", "3s".

– volume: the volume for the contained text in the range 0.0 to 100.0 (higher values are louder and specifying a value of zero is equivalent to specifying "silent"). Legal values are: number, a relative change or "silent", "x-soft", "soft", "medium", "loud", "x-loud", or "default". The volume scale is linear amplitude. The default is 100.0. Labels "silent" through "x-loud" represent a sequence of monotonically non-decreasing volume levels.

Page 36: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

The prosody Element (cont’d)

• Pitch contour. The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output.

• The algorithm for interpolating between the targets is processor-specific.

• In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch attribute (a number followed by "Hz", a relative change, or a label value).

<?xml version="1.0"?><speak> <prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)"> good morning </prosody></speak>

Page 37: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html
Page 38: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Today

• Project reminder• Presentation of the results of the TTS evaluation• Speech Synthesis Poetry Slam• Wrapping up TTS (stages of TTS)• Presentation of home assignment 3: ASR evaluation• Automatic speech recognition (ASR)• Natural language understanding (NLU)• Speech Recognition Grammar Specification (SRGS)• Semantic Interpretation for Speech Recognition (SISR)• Thursday's Lab session

Page 39: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Architecture 1

Page 40: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Wrapping up TTS

• Stages of TTS:– Structure analysis (sentence splitting)– Text normalisation– Text to phoneme conversion– Prosody analysis– Waveform production

• Speech Synthesis Markup Language– enables developers to override default behavior

Page 41: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

TTS stages and SSML elements

Stage SSML elements

Structure analysis (sentence splitting)

<p>, <s>, ., ?, !

Text normalisation <sub>, <say-as>

Text to phoneme conversion <phoneme>

Prosody analysis <prosody>, <emphasis>, <break>,., ?, !

Waveform production <voice>, <audio>

Page 42: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Prosody analysis• Pitch (intonation or melody), timing (rhythm), pauses, speech rate,

emphasis on words, and the relative timing of segments and pauses. 

• most TTS engines have a prosody analysis algorithm responsible for producing the prosody of synthesized speech, which is often based on the parts of speech.  For example, nouns, verbs, and adjectives may be accented; whereas, auxiliary verbs and prepositions may be distressed. 

• Spoken speech pauses for commas and properly inflects the speech depending upon whether the sentence is declarative, interrogative, or exclamatory. 

• Prosody rules and algorithms are not perfect and are a topic of ongoing research.  Prosody rules for different spoken national languages may be quite different.  For example, the prosody for American, British, Indian, and Jamaican pronunciations of English are different. 

Page 43: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Speech Recognition(ASR)

Page 44: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

Architecture 1

Page 45: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

ASR Input and Output

• A speech recognizer is a component with the following inputs and outputs:

• Input

– A grammar or multiple grammars as defined by the SRGS specification. These grammars inform the recognizer of the words and patterns of words to listen for.

– An audio stream that may contain speech content that matches the grammar(s).

– Parameters: timeouts, recognition thresholds, or N-best result counts.

• Output

– Descriptions of results that indicate details about the speech content detected by the speech recognizer. Recognizers will include at least a transcription of any detected words.

– Errors and other performance information such as confidence

Page 46: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS

Page 47: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS

<grammar root="s">

<rule id="s">

hello

</rule>

</grammar>

s -> "hello"

Page 48: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS

<grammar root="s">

<rule id="s">

<one-of>

<item>hello</item>

<item>goodbye</item>

</one-of>

</rule>

</grammar>

s -> "hello"

s -> "goodbye"

s -> "hello"

| "goodbye"

Page 49: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS

<grammar root="s">

<rule id="s">

hello

<item repeat="0-1">

how are you

</item>

</rule>

</grammar>

s -> "hello" ("how are you")

Page 50: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS

<grammar root="s">

<rule id="s">

<item repeat="1-">

hello

</item>

</rule>

</grammar>

s -> "hello"

s -> "hello" s

s -> "hello"+

NOTE: Listing is no longer possible

Page 51: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS

<grammar root="s">

<rule id="s">

<item repeat="1-">

<one-of>

<item>hello</item>

<item>goodbye</item>

</one-of>

</item>

</rule>

</grammar>

Page 52: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS

<grammar root="s">

<rule id="s">

<item repeat="1-">

<ruleref uri="#greeting"/>

</item>

</rule>

<rule id="greeting">

<one-of>

<item>hello</item>

<item>goodbye</item>

</one-of>

</rule>

</grammar>

s -> greeting+

greeting -> "hello"

| "goodbye"

Page 53: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS<grammar root="city_state">

<rule id="city">

<one-of>

<item>Boston</item>

<item>Philadelphia</item>

<item>Fargo</item>

</one-of>

</rule>

<rule id="state">

<one-of>

<item>Florida</item>

<item>North Dakota</item>

<item>New York</item>

</one-of>

</rule>

<rule id="city_state">

<ruleref uri="#city"/>

<ruleref uri="#state"/>

</rule>

</grammar>

Page 54: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS + SISR

<grammar root="s">

<rule id="s">

hello

</rule>

</grammar>

<grammar root="s">

<rule id="s">

<item>

hello

<tag>hi</tag>

</item>

</rule>

</grammar>

Page 55: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SRGS + SISR<grammar root="answer">

<rule id="answer"> <one-of> <item><ruleref uri="#yes"/></item> <item><ruleref uri="#no"/></item> </one-of> </rule>

<rule id="yes"> <one-of> <item>yes</item> <item>yeah<tag>yes</tag></item> <item><token>you bet</token><tag>yes</tag></item> <item xml:lang="fr-CA">oui<tag>yes</tag></item> </one-of> </rule>

<rule id="no"> <one-of> <item>no</item> <item>nope</item> <item>no way</item> </one-of> <tag>no</tag> </rule>

</grammar>

Page 56: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

SISR

• I would like a coca cola and three large pizzas with pepperoni and mushrooms

{ drink: { liquid:"coke", drinksize:"medium"}, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }}

Page 57: Speech Technology. HOT! What are the big players in the area up to? Google – technology.html

<grammar root="order">

<rule id="order"> I would like a <ruleref uri="#drink"/> <tag>out.drink={}; out.drink.liquid=rules.drink.type; out.drink.drinksize=rules.drink.drinksize;</tag> and <ruleref uri="#pizza"/> <tag>out.pizza=rules.pizza;</tag> </rule> <rule id="kindofdrink"> <one-of> <item>coke</item> <item>pepsi</item> <item>coca cola<tag>out="coke";</tag></item> </one-of> </rule> <rule id="foodsize"> <tag>out="medium";</tag> <item repeat="0-1"> <one-of> <item>small<tag>out="small";</tag></item> <item>medium</item> <item>large<tag>out="large";</tag></item> <item>regular<tag>out="medium";</tag></item> </one-of> </item> </rule> <rule id="tops"> <tag>out=[];</tag> <ruleref uri="#top"/> <tag>out.push(rules.top);</tag> <item repeat="1-"> and <ruleref uri="#top"/> <tag>out.push(rules.top);</tag> </item> </rule> <rule id="top"> <one-of> <item>anchovies</item> <item>pepperoni</item> <item>mushroom<tag>out="mushrooms";</tag></item> <item>mushrooms</item> </one-of> </rule>

<rule id="drink"> <ruleref uri="#foodsize"/> <ruleref uri="#kindofdrink"/> <tag>out.drinksize=rules.foodsize; out.type=rules.kindofdrink;</tag> </rule> <rule id="pizza"> <ruleref uri="#number"/> <ruleref uri="#foodsize"/> <tag>out.pizzasize=rules.foodsize; out.number=rules.number;</tag> pizzas with <ruleref uri="#tops"/> <tag>out.topping=rules.tops;</tag> </rule> <rule id="number"> <one-of> <item> <tag>out=1;</tag> <one-of> <item>a</item> <item>one</item> </one-of> </item> <item>two<tag>out=2;</tag></item> <item>three<tag>out=3;</tag></item> </one-of> </rule></grammar>

I would like a coca cola and three large pizzas with pepperoni and mushrooms

{ drink: { liquid:"coke", drinksize:"medium“ }, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }}