listening to the gamer: getting speech recognition in games right
TRANSCRIPT
Listening to the Gamer:Getting Speech Recognition in Games Right
Speaker Information
Jason HewittAdvanced Technology GroupMicrosoft
Dr. Mike FroggattDeveloper LeadMicrosoft Game Studios
Audience
Are you thinking about adding speech to your game?
Are you targeting a console or the PC? Portable platforms are also good!
Takeaway: Speech is on our consoles, it’s easy to add.
Speech in games is not new!
It was in Unreal Tournament 2004! It was on the PS2! It’s there, ready to use.
Two ways to listen
General Dictation Command and Control
Multiple Solutions Fonix http://www.fonixspeech.com
• Platforms: Win32, Xbox 360, PS3, Wii
• Languages: US English, UK English, French, German, Italian, Spanish, Japanese, Korean
Voxler http://www.voxler.eu
• Platforms: Win32, Xbox 360, PS3, Wii, Nintendo DS, iPhone
• Languages: “All major English dialects and European languages”
NuiSpeech • Kinect Only
• Languages: English (US & UK), French, Japanese, Spanish (Mexico)• Preview models of French (Canadian and France, German, Italian, Spanish (Spain) and English (Australian)
• Designed specifically for a 10ft experience
Others are out there
Microphones Overview
Platform Headset Mic Handheld Mic Mic Array/Room Mic
Integrated
PC ? ?
Xbox 360 ? ? Kinect only
PS3 ? ? Eye only
Wii ? Wii Speak only
Sony NGP Yes
3DS/DS/DSi Yes
Smart phones Yes
Each platform has its own microphones and platform capabilities, so you can either take the lowest common denominator or you can customize to each platform’s strengths
Speech has two Inputs
GrammarSpeech
Recognition Engine
Results
Deciding Speech’s Role in the Game
Apply Good Design Principles Set your goals at the beginning of the project:
• Don’t add speech recognition with a month to spare Evaluate the tech
Prototype Rethink the goals
Be consistent Users expect what works once works always
Decide early if you want to count on a Mic Remember requiring Kinect or Move = free Mic
To require a mic?
It’s natural! New gameplay mechanics Expands User Control Controller fallbacks not necessary
But are still a plus!
Or to not require a mic?
Not everyone will have a Mic Accessibility Some gamers won’t or can’t talk
Menus and Pausing
Don’t add at the last second! Think of your menu names and your grammar design at the same
time If you do implement it, let users skip menu pages Beware of a “Pause”
False positves can break flow Best to maybe gather intent Consider allowing users to disable
Key Scenario Integration
Focused on scenarios in games that can provide the biggest impact Dialog tree navigation Merchant/shop interaction
Most ideas here can again be optional; Allow the controller to be a back up
Full Title Integration Doesn’t mean voice only (but could) When approaching the games control scheme,
consider if voice makes sense—for example: Squad commands Activation of controls Help mechanisms Volume of player’s speech levels in a horror or stealth game
What can I say?
Teaching styles “See it! Say it!” then “Know it! Say it!” Repeat after me Explore on your own
Screen awareness Off-screen awareness
Expandable Menu
Soldier! Joe
Frank
Attack
Defend
Retreat
How’s the weather?
The Grammar
The Basics XML based
All use W3C format or a subset of the format http://www.w3.org/TR/speech-grammar/
Multiple rules can be activated or inactivated at once Custom pronunciations are available
This helps with in game items This can also help with difficult to pronounce or understand words
Grammar Size
Check with your middleware provider on how many phrases Key point is going beyond recommended phrases means more
chances to be similar and confusable
Manage active phrases with rules Remember you don’t need the shopkeeper recognition when
fighting the dragon Pause menu interaction should reduce the set of active rules
Evolving the Grammar
Start with a small initial word set Do no proactively add recognition phrases too much See through play testing where gamers go Handle the common cases
Synonyms are a slippery slope Especially in a See it! Say it! scenario
Multiple iterations provides better tuning
A Work in Progress
Design
ImplementTest
Test Each Iteration!
Record your users saying phrases both in and out of grammar
Consider automated nightly tests of each grammar iteration Measure false negatives, false positive, success rates Test in game scenarios
If two grammars are active at the same time, you must test them together
Working with Limitations
Speech is not perfect Generally speech works best when
Background noise is minimum Speaker enunciates The grammar size is within recommendation
Working with Side talk
There may be other noises in the room that the mic picks up
Remember you can still respond to side talk! “Hey, you talking to me?”
“Sorry, my (language) is very limited.”
Test with a garbage rule
Working with Failure
Even a speech recognition failure should be a success
Handle misfires and repeats as part of the game NPCs can have headaches, migraines, or explain their
misunderstanding “Sorry, what was that? I was thinking about sheep.”
Localization Begin localization after most design decisions are locked down
Iterate and design in your native language
Begin before it’s too late to work with translators, manual, etc. Be wary of text/UI translators
Spoken language can vary differently than the written language
Recommend audio translators Leverage your existing in-game dialogue translation team They know the right voice to use for communication “See it! Say it!” implementations will need to be translated by this same team
Have native speakers testing More than one native speaker is always better
Localization
Provide plenty of background of the situation to the translator. More info the better.
You should be doing this for in-game dialogue already; your team’s localization expert will be able to provide guidance here.
Different languages map 1 word to 3 words and 3 to one so provide context for each situation
Remember to coordinate changes across languages
Listening to the Gamer:Getting Speech Recognition in Games RightKinectimals Speech Post-Mortem
“If I could talk with the animals…”
Kinectimals was standard-bearer for speech recognition at Kinect launch
Lofty goals: Natural interaction with animal through speech Praise, issue commands, call animal by name
Ultimately delivered robust recognition for a reasonable command set
Animal naming most challenging component to implement
Goal: Perfect Feline Behaviour
Design<gramma
r xml:lang="en
-us" version="
1.0" root="das
h_commands">
<rule id="da
sh_commands" s
cope="public">
<one-of>
<item>
Hey, i
s this thing o
n? Xbox, can y
ou hear me? He
y Jimmy! Come
look at this!
The Xbox
understands me
!
<tag>
exec
"dash.xex /up
grade_to_gold_
account /quiet
"
</tag>
</item>
<item>
Oh Xbo
x, you’re my o
nly friend - m
y girlfriend’s
left me and n
o one understa
nds me like yo
u
do.
<tag>
exec
"halo_reach.x
ex"
</tag>
</item>
</one-of>
</rule>
</grammar>
Design Giveth…
Game design is our friend No expectation of animals understanding speech
perfectly Player more forgiving of incorrect or failed
recognition Children interpreted failed recognition as animal
“being naughty”
…Design Taketh Away
Design is our enemy Familiar situation produces habitual response
Expectation that what a real animal responds to, the game will respond to
Commands framed with non-essential vocal noise “Hey Skittles, sit down, please”
Speech commands often mode-less Where to allow / disallow them?
Don’t Both Talk at Once Narrator character introduced late in design
Gave instruction on gestures and speech commands to use
Narrator saying “Sit down” often made animal sit down Specific hardware can help with this
Kinect has array microphone with Multichannel Echo Cancellation (MEC) Effectiveness dependent on microphone calibration
Better to avoid issue altogether if possible Disable speech recognition while narrator speaking Watch out for NPC speech triggering commands during gameplay
Example: team-mates shouting “Take cover!”
Implementation
Final Grammar Most complex command grammar:
Concurrent detection of 16 different phrases Mapped to 9 distinct commands (“Sit” equivalent to “Sit down”)
Name recognition also running Some state-based selection of different grammars
However this was worst-case scenario (most rules active)
Manually specifying phonemes for a given rule can help increase recognition accuracy
May be needed for proprietary or game-specific terms like character names Built-in text-to-phoneme rules may not work well in these cases
<rule id="reserved" scope="public"> <one-of> <item> <token sapi:display="Kinect" sapi:pron="K IH N EH K T"> kinect </token> </item> </one-of></rule>
Playing <tag> <tag> element allows a single
semantic to be associated with multiple utterances
Also provides language invariance
Great way to encode per rule data Accept confidence threshold, for
example
Parsed at run-time, so don’t go overboard
<item> <one-of> <item> sit </item> <item> sit down </item> </one-of> <tag>Sit</tag></item>
<item> <one-of> <item> go play <tag>conf=0.45</tag> </item> </one-of> <tag>Dismiss</tag></item>
Please Stop Talking Speech is unpredictable
Valid utterances may vary widely in length Background noise may end up being processed for recognition
Changing state of Speech Recognition engine may incur unexpected synchronization delays Can occur when stopping recognition, changing rule states or loading new grammars
Bugs can become highly context-sensitive May see occasional frame-outs when tested in noisy open-plan area, but not when tested in closed
office
Easiest option: run all game-side speech processing on separate thread Move off the h/w thread that the main game is using Speech will typically not saturate a core
Name Your Animal (NYA) Allow player to speak name they want to use No attempt to turn spoken name into real text
(for display) Instead use a pictorial (camera capture)
representation for identification
Implemented as free form speech to phoneme conversion
Then use phonemes to build a grammar rule with custom pronunciation
Name used to attract animal’s attention, just as it would be in real life
Pushes the limits of NuiSpeech
NYA Challenges Used a special grammar for speech to phoneme conversion Much larger than normal command grammars
11.5MB for largest NYA grammar vs. 5KB for largest command grammar
Also requires a dynamic grammar to add the “name” rule to So even more memory for the acoustic model
Much more sensitive to environmental noise than the normal speech commands Naming process would sometimes drive itself to completion from noise in the room
Watched for some reserved terms (“Kinect”, “Xbox”), no attempt to catch swearing etc. Space of potentially prohibited terms simply too large
Reject names that are too long as difficult for the player to repeat successfully
NYA Flow One utterance unlikely to be sufficient to get the
“right” name Allow a number of attempts to successfully repeat
name Hopefully deals with player trying to mislead the
system
If no repeat in sensible number of attempts, prompt player to try different name
Try to avoid player getting stuck trying to repeat “problem” name
Balance ease of use when player is using the system “correctly” against rejecting noise as a naming attempt
“CH I Z AX N” generated
“CH I T AX” ideal string
NYA Internationalization
Separate speech to phoneme grammar for each NuiSpeech language
NYA accuracy varies across languages
US English NYA works well for languages other than English Tested in 11 additional countries
Allowed us to support NYA in countries that weren’t supported by NuiSpeech at launch
Testing
The Challenge of Testing Speech
Human beings very good at spotting patterns Even non-existent ones
Easy to find reasons why speech works better or worse “Speech works better when I wear a blue shirt!”
In reality, recognition strongly influenced by exact acoustic environment
So test with lots of people, and lots of different conditions Individual office vs. open plan
Look at whether player successfully completes tasks with speech Not just whether individual commands are recognised (too conservative) Watch out for commands that never seem to work however!
Make low-level speech success / failure events visible On-screen log is very useful
Heed the Advice of W. C. Fields
Never work with children or animals Kinectimals had both… Recognition confidences for children inherently lower than for adults Can be self-conscious about “talking to the TV” leading to them not speaking clearly If they become frustrated, they may shout or do other things that make recognition
worse, not better Tutor them through which speech commands to use, and how best to say them Set confidence thresholds lower and accept some degree of False Accepts for adult
speakers This can be difficult since your test / development team will get a worse experience
What We Learnt Integration of speech recognition system straightforward (even with NYA)
But testing hard and time-consuming!
Look at task completion, not purely at recognition accuracy Players will probably not notice occasionally having to repeat commands
Contrast issuing commands to the game, versus talking to an in-game character Issuing commands: small command set, but very high accuracy required Talking to character: more tolerant of failed recognition, but larger command set, or even natural language expected
Naming things via speech is hard You probably won’t have access to generic speech-to-text capabilities If you can, use text input to acquire the name and then add it dynamically as a grammar rule
You may want a custom lexicon of common / difficult names to ensure correct phonemes used
Accept you may not be able to please everyone all the time Weight success towards your primary audience
Thank you to…
Xbox Platform Speech Team Kinectimals Team at Frontier
No animals were harmed in the making of this game A few testers lost their voices however