listening to the gamer: getting speech recognition in games right

Listening to the Gamer:Getting Speech Recognition in Games Right

Speaker Information

Jason HewittAdvanced Technology GroupMicrosoft

Dr. Mike FroggattDeveloper LeadMicrosoft Game Studios

Audience

Are you thinking about adding speech to your game?

Are you targeting a console or the PC? Portable platforms are also good!

Takeaway: Speech is on our consoles, it’s easy to add.

Speech in games is not new!

It was in Unreal Tournament 2004! It was on the PS2! It’s there, ready to use.

Two ways to listen

General Dictation Command and Control

Multiple Solutions Fonix http://www.fonixspeech.com

• Platforms: Win32, Xbox 360, PS3, Wii

• Languages: US English, UK English, French, German, Italian, Spanish, Japanese, Korean

Voxler http://www.voxler.eu

• Platforms: Win32, Xbox 360, PS3, Wii, Nintendo DS, iPhone

• Languages: “All major English dialects and European languages”

NuiSpeech • Kinect Only

• Languages: English (US & UK), French, Japanese, Spanish (Mexico)• Preview models of French (Canadian and France, German, Italian, Spanish (Spain) and English (Australian)

• Designed specifically for a 10ft experience

Others are out there

Microphones Overview

Platform Headset Mic Handheld Mic Mic Array/Room Mic

Integrated

PC ? ?

Xbox 360 ? ? Kinect only

PS3 ? ? Eye only

Wii ? Wii Speak only

Sony NGP Yes

3DS/DS/DSi Yes

Smart phones Yes

Each platform has its own microphones and platform capabilities, so you can either take the lowest common denominator or you can customize to each platform’s strengths

Speech has two Inputs

GrammarSpeech

Recognition Engine

Results

Deciding Speech’s Role in the Game

Apply Good Design Principles Set your goals at the beginning of the project:

• Don’t add speech recognition with a month to spare Evaluate the tech

Prototype Rethink the goals

Be consistent Users expect what works once works always

Decide early if you want to count on a Mic Remember requiring Kinect or Move = free Mic

To require a mic?

It’s natural! New gameplay mechanics Expands User Control Controller fallbacks not necessary

But are still a plus!

Or to not require a mic?

Not everyone will have a Mic Accessibility Some gamers won’t or can’t talk

Menus and Pausing

Don’t add at the last second! Think of your menu names and your grammar design at the same

time If you do implement it, let users skip menu pages Beware of a “Pause”

False positves can break flow Best to maybe gather intent Consider allowing users to disable

Key Scenario Integration

Focused on scenarios in games that can provide the biggest impact Dialog tree navigation Merchant/shop interaction

Most ideas here can again be optional; Allow the controller to be a back up

Full Title Integration Doesn’t mean voice only (but could) When approaching the games control scheme,

consider if voice makes sense—for example: Squad commands Activation of controls Help mechanisms Volume of player’s speech levels in a horror or stealth game

What can I say?

Teaching styles “See it! Say it!” then “Know it! Say it!” Repeat after me Explore on your own

Screen awareness Off-screen awareness

Expandable Menu

Soldier! Joe

Frank

Attack

Defend

Retreat

How’s the weather?

The Grammar

The Basics XML based

All use W3C format or a subset of the format http://www.w3.org/TR/speech-grammar/

Multiple rules can be activated or inactivated at once Custom pronunciations are available

This helps with in game items This can also help with difficult to pronounce or understand words

http://www.w3.org/TR/speech-grammar/

Grammar Size

Check with your middleware provider on how many phrases Key point is going beyond recommended phrases means more

chances to be similar and confusable

Manage active phrases with rules Remember you don’t need the shopkeeper recognition when

fighting the dragon Pause menu interaction should reduce the set of active rules

Evolving the Grammar

Start with a small initial word set Do no proactively add recognition phrases too much See through play testing where gamers go Handle the common cases

Synonyms are a slippery slope Especially in a See it! Say it! scenario

Multiple iterations provides better tuning

A Work in Progress

Design

ImplementTest

Test Each Iteration!

Record your users saying phrases both in and out of grammar

Consider automated nightly tests of each grammar iteration Measure false negatives, false positive, success rates Test in game scenarios

If two grammars are active at the same time, you must test them together

Working with Limitations

Speech is not perfect Generally speech works best when

Background noise is minimum Speaker enunciates The grammar size is within recommendation

Working with Side talk

There may be other noises in the room that the mic picks up

Remember you can still respond to side talk! “Hey, you talking to me?”

“Sorry, my (language) is very limited.”

Test with a garbage rule

Working with Failure

Even a speech recognition failure should be a success

Handle misfires and repeats as part of the game NPCs can have headaches, migraines, or explain their

misunderstanding “Sorry, what was that? I was thinking about sheep.”

Localization Begin localization after most design decisions are locked down

Iterate and design in your native language

Begin before it’s too late to work with translators, manual, etc. Be wary of text/UI translators

Spoken language can vary differently than the written language

Recommend audio translators Leverage your existing in-game dialogue translation team They know the right voice to use for communication “See it! Say it!” implementations will need to be translated by this same team

Have native speakers testing More than one native speaker is always better

Localization

Provide plenty of background of the situation to the translator. More info the better.

You should be doing this for in-game dialogue already; your team’s localization expert will be able to provide guidance here.

Different languages map 1 word to 3 words and 3 to one so provide context for each situation

Remember to coordinate changes across languages

Listening to the Gamer:Getting Speech Recognition in Games RightKinectimals Speech Post-Mortem

“If I could talk with the animals…”

Kinectimals was standard-bearer for speech recognition at Kinect launch

Lofty goals: Natural interaction with animal through speech Praise, issue commands, call animal by name

Ultimately delivered robust recognition for a reasonable command set

Animal naming most challenging component to implement

Goal: Perfect Feline Behaviour

Design<gramma

r xml:lang="en

-us" version="

1.0" root="das

h_commands">

<rule id="da

sh_commands" s

cope="public">

<one-of>

<item>

Hey, i

s this thing o

n? Xbox, can y

ou hear me? He

y Jimmy! Come

look at this!

The Xbox

understands me

!

<tag>

exec

"dash.xex /up

grade_to_gold_

account /quiet

"

</tag>

</item>

<item>

Oh Xbo

x, you’re my o

nly friend - m

y girlfriend’s

left me and n

o one understa

nds me like yo

u

do.

<tag>

exec

"halo_reach.x

ex"

</tag>

</item>

</one-of>

</rule>

</grammar>

Design Giveth…

Game design is our friend No expectation of animals understanding speech

perfectly Player more forgiving of incorrect or failed

recognition Children interpreted failed recognition as animal

“being naughty”

…Design Taketh Away

Design is our enemy Familiar situation produces habitual response

Expectation that what a real animal responds to, the game will respond to

Commands framed with non-essential vocal noise “Hey Skittles, sit down, please”

Speech commands often mode-less Where to allow / disallow them?

Don’t Both Talk at Once Narrator character introduced late in design

Gave instruction on gestures and speech commands to use

Narrator saying “Sit down” often made animal sit down Specific hardware can help with this

Kinect has array microphone with Multichannel Echo Cancellation (MEC) Effectiveness dependent on microphone calibration

Better to avoid issue altogether if possible Disable speech recognition while narrator speaking Watch out for NPC speech triggering commands during gameplay

Example: team-mates shouting “Take cover!”

Implementation

Final Grammar Most complex command grammar:

Concurrent detection of 16 different phrases Mapped to 9 distinct commands (“Sit” equivalent to “Sit down”)

Name recognition also running Some state-based selection of different grammars

However this was worst-case scenario (most rules active)

Manually specifying phonemes for a given rule can help increase recognition accuracy

May be needed for proprietary or game-specific terms like character names Built-in text-to-phoneme rules may not work well in these cases

<rule id="reserved" scope="public"> <one-of> <item> <token sapi:display="Kinect" sapi:pron="K IH N EH K T"> kinect </token> </item> </one-of></rule>

Playing <tag> <tag> element allows a single

semantic to be associated with multiple utterances

Also provides language invariance

Great way to encode per rule data Accept confidence threshold, for

example

Parsed at run-time, so don’t go overboard

<item> <one-of> <item> sit </item> <item> sit down </item> </one-of> <tag>Sit</tag></item>

<item> <one-of> <item> go play <tag>conf=0.45</tag> </item> </one-of> <tag>Dismiss</tag></item>

Please Stop Talking Speech is unpredictable

Valid utterances may vary widely in length Background noise may end up being processed for recognition

Changing state of Speech Recognition engine may incur unexpected synchronization delays Can occur when stopping recognition, changing rule states or loading new grammars

Bugs can become highly context-sensitive May see occasional frame-outs when tested in noisy open-plan area, but not when tested in closed

office

Easiest option: run all game-side speech processing on separate thread Move off the h/w thread that the main game is using Speech will typically not saturate a core

Name Your Animal (NYA) Allow player to speak name they want to use No attempt to turn spoken name into real text

(for display) Instead use a pictorial (camera capture)

representation for identification

Implemented as free form speech to phoneme conversion

Then use phonemes to build a grammar rule with custom pronunciation

Name used to attract animal’s attention, just as it would be in real life

Pushes the limits of NuiSpeech

NYA Challenges Used a special grammar for speech to phoneme conversion Much larger than normal command grammars

11.5MB for largest NYA grammar vs. 5KB for largest command grammar

Also requires a dynamic grammar to add the “name” rule to So even more memory for the acoustic model

Much more sensitive to environmental noise than the normal speech commands Naming process would sometimes drive itself to completion from noise in the room

Watched for some reserved terms (“Kinect”, “Xbox”), no attempt to catch swearing etc. Space of potentially prohibited terms simply too large

Reject names that are too long as difficult for the player to repeat successfully

NYA Flow One utterance unlikely to be sufficient to get the

“right” name Allow a number of attempts to successfully repeat

name Hopefully deals with player trying to mislead the

system

If no repeat in sensible number of attempts, prompt player to try different name

Try to avoid player getting stuck trying to repeat “problem” name

Balance ease of use when player is using the system “correctly” against rejecting noise as a naming attempt

“CH I Z AX N” generated

“CH I T AX” ideal string

NYA Internationalization

Separate speech to phoneme grammar for each NuiSpeech language

NYA accuracy varies across languages

US English NYA works well for languages other than English Tested in 11 additional countries

Allowed us to support NYA in countries that weren’t supported by NuiSpeech at launch

Testing

The Challenge of Testing Speech

Human beings very good at spotting patterns Even non-existent ones

Easy to find reasons why speech works better or worse “Speech works better when I wear a blue shirt!”

In reality, recognition strongly influenced by exact acoustic environment

So test with lots of people, and lots of different conditions Individual office vs. open plan

Look at whether player successfully completes tasks with speech Not just whether individual commands are recognised (too conservative) Watch out for commands that never seem to work however!

Make low-level speech success / failure events visible On-screen log is very useful

Heed the Advice of W. C. Fields

Never work with children or animals Kinectimals had both… Recognition confidences for children inherently lower than for adults Can be self-conscious about “talking to the TV” leading to them not speaking clearly If they become frustrated, they may shout or do other things that make recognition

worse, not better Tutor them through which speech commands to use, and how best to say them Set confidence thresholds lower and accept some degree of False Accepts for adult

speakers This can be difficult since your test / development team will get a worse experience

What We Learnt Integration of speech recognition system straightforward (even with NYA)

But testing hard and time-consuming!

Look at task completion, not purely at recognition accuracy Players will probably not notice occasionally having to repeat commands

Contrast issuing commands to the game, versus talking to an in-game character Issuing commands: small command set, but very high accuracy required Talking to character: more tolerant of failed recognition, but larger command set, or even natural language expected

Naming things via speech is hard You probably won’t have access to generic speech-to-text capabilities If you can, use text input to acquire the name and then add it dynamically as a grammar rule

You may want a custom lexicon of common / difficult names to ensure correct phonemes used

Accept you may not be able to please everyone all the time Weight success towards your primary audience

Thank you to…

Xbox Platform Speech Team Kinectimals Team at Frontier

No animals were harmed in the making of this game A few testers lost their voices however

listening to the gamer: getting speech recognition in games right

Documents

control slide

plus slide

free mic slide

games right slide

stealth game slide

platforms strengths

wii languages

us english