listening to the gamer: getting speech recognition in games right

Listening to the Gamer:Getting Speech Recognition in Games Right

Speaker Information Jason Hewitt

Advanced Technology GroupMicrosoft

Dr. Mike FroggattDeveloper LeadMicrosoft Game Studios

Audience Are you thinking about adding speech to your

game? Are you targeting a console or the PC?

Portable platforms are also good! Takeaway: Speech is on our consoles, it’s easy to

add.

Speech in games is not new! It was in Unreal Tournament 2004! It was on the PS2! It’s there, ready to use.

Two ways to listen General Dictation Command and Control

Multiple Solutions Fonix http://www.fonixspeech.com

• Platforms: Win32, Xbox 360, PS3, Wii• Languages: US English, UK English, French, German, Italian, Spanish, Japanese, Korean

Voxler http://www.voxler.eu• Platforms: Win32, Xbox 360, PS3, Wii, Nintendo DS, iPhone• Languages: “All major English dialects and European languages”

NuiSpeech • Kinect Only• Languages: English (US & UK), French, Japanese, Spanish (Mexico)

• Preview models of French (Canadian and France, German, Italian, Spanish (Spain) and English (Australian)• Designed specifically for a 10ft experience

Others are out there

Microphones OverviewPlatform Headset Mic Handheld Mic Mic Array/

Room MicIntegrated

PC ? ?Xbox 360 ? ? Kinect onlyPS3 ? ? Eye onlyWii ? Wii Speak onlySony NGP Yes3DS/DS/DSi YesSmart phones Yes

Each platform has its own microphones and platform capabilities, so you can either take the lowest common denominator or you can customize to each platform’s strengths

Speech has two Inputs

GrammarSpeech

Recognition Engine

Results

Deciding Speech’s Role in the Game

Apply Good Design Principles Set your goals at the beginning of the project: • Don’t add speech recognition with a month to spare Evaluate the tech

Prototype Rethink the goals

Be consistent Users expect what works once works always

Decide early if you want to count on a Mic Remember requiring Kinect or Move = free Mic

To require a mic? It’s natural! New gameplay mechanics Expands User Control Controller fallbacks not necessary

But are still a plus!

Or to not require a mic? Not everyone will have a Mic Accessibility Some gamers won’t or can’t talk

Menus and Pausing Don’t add at the last second! Think of your menu names and your grammar design at the same

time If you do implement it, let users skip menu pages Beware of a “Pause”

False positves can break flow Best to maybe gather intent Consider allowing users to disable

Key Scenario Integration Focused on scenarios in games that can provide

the biggest impact Dialog tree navigation Merchant/shop interaction

Most ideas here can again be optional; Allow the controller to be a back up

Full Title Integration Doesn’t mean voice only (but could) When approaching the games control scheme,

consider if voice makes sense—for example: Squad commands Activation of controls Help mechanisms Volume of player’s speech levels in a horror or stealth game

What can I say? Teaching styles

“See it! Say it!” then “Know it! Say it!” Repeat after me Explore on your own

Screen awareness Off-screen awareness

Expandable MenuSoldier! Joe

FrankAttackDefendRetreat

How’s the weather?

The Grammar

The Basics XML based

All use W3C format or a subset of the format http://www.w3.org/TR/speech-grammar/

Multiple rules can be activated or inactivated at once Custom pronunciations are available

This helps with in game items This can also help with difficult to pronounce or understand words

http://www.w3.org/TR/speech-grammar/

Grammar Size Check with your middleware provider on how many phrases

Key point is going beyond recommended phrases means more chances to be similar and confusable

Manage active phrases with rules Remember you don’t need the shopkeeper recognition when

fighting the dragon Pause menu interaction should reduce the set of active rules

Evolving the Grammar Start with a small initial word set

Do no proactively add recognition phrases too much See through play testing where gamers go Handle the common cases

Synonyms are a slippery slope Especially in a See it! Say it! scenario

Multiple iterations provides better tuning

A Work in Progress

Design

ImplementTest

Test Each Iteration! Record your users saying phrases both in and out of grammar Consider automated nightly tests of each grammar iteration Measure false negatives, false positive, success rates Test in game scenarios

If two grammars are active at the same time, you must test them together

Working with Limitations Speech is not perfect Generally speech works best when

Background noise is minimum Speaker enunciates The grammar size is within recommendation

Working with Side talk There may be other noises in the room that the

mic picks up Remember you can still respond to side talk!

“Hey, you talking to me?”“Sorry, my (language) is very limited.”

Test with a garbage rule

Working with Failure Even a speech recognition failure should be a

success Handle misfires and repeats as part of the game

NPCs can have headaches, migraines, or explain their misunderstanding

“Sorry, what was that? I was thinking about sheep.”

Localization Begin localization after most design decisions are locked down

Iterate and design in your native language Begin before it’s too late to work with translators, manual, etc. Be wary of text/UI translators

Spoken language can vary differently than the written language Recommend audio translators

Leverage your existing in-game dialogue translation team They know the right voice to use for communication “See it! Say it!” implementations will need to be translated by this same team

Have native speakers testing More than one native speaker is always better

Localization Provide plenty of background of the situation to the

translator. More info the better. You should be doing this for in-game dialogue already; your

team’s localization expert will be able to provide guidance here. Different languages map 1 word to 3 words and 3 to one so

provide context for each situation Remember to coordinate changes across languages

Listening to the Gamer:Getting Speech Recognition in Games RightKinectimals Speech Post-Mortem

“If I could talk with the animals…” Kinectimals was standard-bearer for speech recognition at Kinect

launch Lofty goals:

Natural interaction with animal through speech Praise, issue commands, call animal by name

Ultimately delivered robust recognition for a reasonable command set

Animal naming most challenging component to implement

Goal: Perfect Feline Behaviour

Design<grammar

xml:lang="en-us

" version="1.0"

root="dash_comma

nds">

<rule id="dash

_commands" scope

="public">

<one-of>

<item>

Hey, is

this thing on? X

box, can you hea

r me? Hey Jimmy!

Come look at th

is! The Xbox

understands me!

<tag>

exec "

dash.xex /upgrad

e_to_gold_accoun

t /quiet"

</tag>

</item>

<item>

Oh Xbox,

you’re my only

friend - my girl

friend’s left me

and no one unde

rstands me like

you

do.

<tag>

exec "

halo_reach.xex"

</tag>

</item>

</one-of>

</rule>

</grammar>

Design Giveth… Game design is our friend

No expectation of animals understanding speech perfectly

Player more forgiving of incorrect or failed recognition Children interpreted failed recognition as animal

“being naughty”

…Design Taketh Away Design is our enemy

Familiar situation produces habitual response Expectation that what a real animal responds to, the game will respond

to Commands framed with non-essential vocal noise

“Hey Skittles, sit down, please” Speech commands often mode-less

Where to allow / disallow them?

Don’t Both Talk at Once Narrator character introduced late in design

Gave instruction on gestures and speech commands to use Narrator saying “Sit down” often made animal sit down Specific hardware can help with this

Kinect has array microphone with Multichannel Echo Cancellation (MEC) Effectiveness dependent on microphone calibration

Better to avoid issue altogether if possible Disable speech recognition while narrator speaking Watch out for NPC speech triggering commands during gameplay

Example: team-mates shouting “Take cover!”

Implementation

Final Grammar Most complex command grammar:

Concurrent detection of 16 different phrases Mapped to 9 distinct commands (“Sit” equivalent to “Sit down”)

Name recognition also running Some state-based selection of different grammars

However this was worst-case scenario (most rules active) Manually specifying phonemes for a given rule can help increase recognition

accuracy May be needed for proprietary or game-specific terms like character names

Built-in text-to-phoneme rules may not work well in these cases

<rule id="reserved" scope="public"> <one-of> <item> <token sapi:display="Kinect" sapi:pron="K IH N EH K T"> kinect </token> </item> </one-of></rule>

Playing <tag> <tag> element allows a single

semantic to be associated with multiple utterances

Also provides language invariance Great way to encode per rule data

Accept confidence threshold, for example

Parsed at run-time, so don’t go overboard

<item> <one-of> <item> sit </item> <item> sit down </item> </one-of> <tag>Sit</tag></item>

<item> <one-of> <item> go play <tag>conf=0.45</tag> </item> </one-of> <tag>Dismiss</tag></item>

Please Stop Talking Speech is unpredictable

Valid utterances may vary widely in length Background noise may end up being processed for recognition

Changing state of Speech Recognition engine may incur unexpected synchronization delays Can occur when stopping recognition, changing rule states or loading new grammars

Bugs can become highly context-sensitive May see occasional frame-outs when tested in noisy open-plan area, but not when tested in closed

office Easiest option: run all game-side speech processing on separate thread

Move off the h/w thread that the main game is using Speech will typically not saturate a core

Name Your Animal (NYA) Allow player to speak name they want to use No attempt to turn spoken name into real text

(for display) Instead use a pictorial (camera capture)

representation for identification Implemented as free form speech to phoneme

conversion Then use phonemes to build a grammar rule

with custom pronunciation Name used to attract animal’s attention, just as

it would be in real life Pushes the limits of NuiSpeech

NYA Challenges Used a special grammar for speech to phoneme conversion Much larger than normal command grammars

11.5MB for largest NYA grammar vs. 5KB for largest command grammar Also requires a dynamic grammar to add the “name” rule to

So even more memory for the acoustic model Much more sensitive to environmental noise than the normal speech commands

Naming process would sometimes drive itself to completion from noise in the room Watched for some reserved terms (“Kinect”, “Xbox”), no attempt to catch swearing etc.

Space of potentially prohibited terms simply too large Reject names that are too long as difficult for the player to repeat successfully

NYA Flow One utterance unlikely to be sufficient to get the

“right” name Allow a number of attempts to successfully repeat

name Hopefully deals with player trying to mislead the system

If no repeat in sensible number of attempts, prompt player to try different name

Try to avoid player getting stuck trying to repeat “problem” name

Balance ease of use when player is using the system “correctly” against rejecting noise as a naming attempt

“CH I Z AX N” generated

“CH I T AX” ideal string

NYA Internationalization Separate speech to phoneme grammar for each NuiSpeech

language NYA accuracy varies across languages

US English NYA works well for languages other than English Tested in 11 additional countries

Allowed us to support NYA in countries that weren’t supported by NuiSpeech at launch

Testing

The Challenge of Testing Speech Human beings very good at spotting patterns

Even non-existent ones Easy to find reasons why speech works better or worse

“Speech works better when I wear a blue shirt!” In reality, recognition strongly influenced by exact acoustic

environment So test with lots of people, and lots of different conditions

Individual office vs. open plan Look at whether player successfully completes tasks with speech

Not just whether individual commands are recognised (too conservative) Watch out for commands that never seem to work however!

Make low-level speech success / failure events visible On-screen log is very useful

Heed the Advice of W. C. Fields Never work with children or animals Kinectimals had both… Recognition confidences for children inherently lower than for adults Can be self-conscious about “talking to the TV” leading to them not speaking clearly If they become frustrated, they may shout or do other things that make recognition

worse, not better Tutor them through which speech commands to use, and how best to say them Set confidence thresholds lower and accept some degree of False Accepts for adult

speakers This can be difficult since your test / development team will get a worse experience

What We Learnt Integration of speech recognition system straightforward (even with NYA)

But testing hard and time-consuming! Look at task completion, not purely at recognition accuracy

Players will probably not notice occasionally having to repeat commands Contrast issuing commands to the game, versus talking to an in-game character

Issuing commands: small command set, but very high accuracy required Talking to character: more tolerant of failed recognition, but larger command set, or even natural language expected

Naming things via speech is hard You probably won’t have access to generic speech-to-text capabilities If you can, use text input to acquire the name and then add it dynamically as a grammar rule

You may want a custom lexicon of common / difficult names to ensure correct phonemes used Accept you may not be able to please everyone all the time

Weight success towards your primary audience

Thank you to… Xbox Platform Speech Team Kinectimals Team at Frontier

No animals were harmed in the making of this game A few testers lost their voices however

listening to the gamer: getting speech recognition in games right

Documents