the universal speech interface (usi) pdg progress report thomas harris, stefanie tomko, arthur toth,...

43
The Universal Speech Interface (USI) PDG Progress Report Thomas Harris, Stefanie Tomko, Arthur Toth, James Sanders, Alex Rudnicky, Roni Rosenfeld School of Computer Science Carnegie Mellon University 4 June 2003

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

The Universal Speech Interface (USI) PDG Progress Report

Thomas Harris, Stefanie Tomko, Arthur Toth, James Sanders,

Alex Rudnicky, Roni Rosenfeld

School of Computer Science

Carnegie Mellon University

4 June 2003

Outline

• USI Project Summary• USI Device Control• USI User Studies• Tech Transfer Initiative

– USI Application Generator

Program Goals and Plan

• Overall program goal: – Design a universal (i.e. device-independent)

interface for speech-based interaction with wearable and home devices

• Program plan & milestones:– Q1: analysis, interaction principles– Q2: build device-simulation environment– Q3: build first device prototype– Q4: initial user studies; development tools

Program Deliverables

• A novel universal design for speech-based interaction with wearable- and home-devices

• At least one demonstration system exemplifying the new interface

• A set of tools for rapid prototyping of compliant applications

The Universal Speech Interface (USI)In a Nutshell

• Unifying approach to human-machine speech communication

• Unified “look and feel” across all applications– analogous to the Xerox/Macintosh/Windows GUI

look-and-feel

• Stylized, semi-natural interaction– analogous to the “Graffiti” alphabet for the Palm

PDA

Existing Speech Paradigm 1:Command-and-control Systems

• Specialized language, optimized for a given application– each application has its own interface

• Intensive training of each user• Daily use helps retain knowledge

Existing Speech Paradigm 2:Unconstrained Dialog Systems

• “Off-the-street” users, no training required• System models existing human behavior• But this comes at a cost:

– each application requires a great deal of data, labor, human expertise

– Speech Recognition technology is pushed to the limit– user does not easily grasp the application’s

functional limits• Out-Of-Vocabulary words (OOV)• Out-Of-Domain concepts, requests

Is a Third Paradigm Needed?

• In practice, people are likely to use:– a handful of apps daily:

• scheduler, contact manager, email,...

– many apps occasionally:• weather, restaurants, ...

• To exploit this, we need:– flexible, powerful interface for familiar applications.– immediate engagement with occasional or new

applications.

Our Approach

• Identify application-independent universals:– user-side– machine-side

• Find suitable, general solutions– Human and machine meeting halfway

• Design a stylized, universal “look and feel”• Teach it in 5 minutes

Universal Semantic primitives

• Help primitives– what can the machine do? how do I do X? what can I say?

• Speech channel primitives– detect & correct ASR errors; finished talking?

• Interaction primitives– turn taking; question answering; session management; undo

• Application primitives– environment variables: query, set– objects (e.g. lists): describe, navigate, create, modify, delete

USI Systems Developed

• Information Access– MovieLine– FlightLine– ApartmentLine

• Device Control– Stereo system– X-10 control (e.g., lights)– Alarm Clock applet– Digital Video Camera– Windows Media Player

USI Demonstration

• MovieLine– Experimental subject

USI Device Control

Device Interaction Analysis

• Analysis was done on multiple devices– alarm clock / radio– VCR– cell phone– MP3 player– memo pad / email / vmail– copier/fax

USI/Device Design Issues

• Confirmation strategy• Error handling strategy• Exploration• Navigation• Disambiguation / context mgmt• Orientation• Querying state variables

USI/Device Design Issues

• Confirmation strategy: restate-&-execute

• Error handling strategy: ignore

• Exploration: “OPTIONS”

• Navigation: use concept of ‘focus’

• Disambiguation / context mgmt: implicit

• Orientation: “STATUS”

• Querying state variables: “WHAT IS THE...?”

Hooking up with the PUC project

• Fits within the PUC project’s vision of automatically generated interfaces with different modalities and form factors

• But, can also be used as a standalone speech interface

• Compatibility with visual design is desirable, but not always natural:– nameless states (speech interface must have

name for everything!)– speech interface can have shortcuts (“MODE: CD”

vs. “CD”)

Meshing with the PUC project

• Device capabilities specified by XML doc• States vs. Action dichotomy of the visual

interface does not always conform to speech interface intuition.

• For now, creating our own interface specification document

• Ultimately, will augment XML DTD, so both interfaces can co-exist

USI Device control(a.k.a. James the Butler)

frequency...

station...

am

frequency...

station...

fm

(radioband)

forw ard

backw ard

seek

tuner auxiliary

play

pause

stop

(status)

#

disc

next track last track

random ... repeat...

cd

(m ode)<turns stereo on>

on

off

x-bass

volum e up

volum e dow n

volum e off

Stereo

digital camera...

James

Hardware hacking courtesy of the PUC project

USI Demonstration

• Device Control– Alarm Clock Example

User Studies

User study

• Compared Speech Graffiti (SG) & natural language MovieLines

• How does Speech Graffiti compare to a natural language interface?– Subjective user satisfaction– Task completion rates– Word error rates

• How do well do users "get" Speech Graffiti?– How often do they speak within the grammar?– In what ways do they deviate from the grammar?

Subjective user satisfaction

• 17 of 23 preferred Speech Graffiti (SG)

1 2 3 4 5 6 7

system resp. acc.

likeability

cog. demand

annoyance

habitability

speed

OVERALL

mean user satisfaction rating

NL-ML

SG-ML

• SG user satisfaction ratings higher than NL in all categories

• SG ratings positive except in annoyance & habitability

Computer experience & training

• Computer Science / Engineering backgrounds and / or programming experience – Higher user satisfaction ratings– Better task completion rates

• Training in-domain vs. out-of-domain– No differences in user satisfaction or task

completion rates

Task completion

• Overall– 67.9% SG tasks– 67.4% NL tasks

• Individual means– 5.43 of 8 SG tasks– 5.30 of 8 NL tasks

0

1

2

3

4

5

6

7

8

mean t

ask

com

ple

tion r

ate

SG-ML NL-ML

Time-to-completion

• Completed tasks– 67.9 seconds SG – 73.4 seconds NL

• Incomplete tasks:

1 2 3 4

0

200

400

600

time, in seconds

“best case” “real world”

27.3

43.5

76.0

23.0

38.0

103.8

(inc)

81.5

34.0

(inc)

103.0

28.0

59 incompletes 59 incompletes

SGML SGMLNLML NLML

Turns-to-completion

• Completed tasks– 8.2 turns SG – 3.9 turns NL

• Incomplete tasks:

1 2 3 4

5

20

3535

5

20

(inc) (inc)

4

5

9.75

1

2

510

4

5

“best case” “real world”

# of turns

SG-ML SG-MLNL-ML NL-ML

59 incompletes 59 incompletes

2

Word error rates

• Very high for both systems– On "cleaned" set (on-task, non-noisy utts)

• Concept error is lower for USI – SG: –29.2% from WER– NL: +0.8% from WER

• Low error rate is key to acceptance– 6 who preferred NL-ML had highest SG WER

WER# of utts

subj mean

subj median

SG Movie 35.1% 3626 35.0% 30.0%NL Movie 51.2% 1854 50.3% 48.9%

WER & user satisfaction

• Good correlation for SG

SG-ML

% word-error rate0 20 40 60 80

1

2

3

4

5

6

0 20 40 60 801

2

3

4

5

6

user

sati

sfa

cti

on

rati

ng

NL-ML

How often do users speak within the Speech Graffiti grammar?

• Actually, pretty often!

… and

• grammaticality leads to user satisfaction

mean 80.5%median 87.4%

1

2

3

4

5

6

7

0% 20% 40% 60% 80% 100%

% grammatical

use

r sa

tisf

act

ion r

ati

ng

How do users deviate from the grammar?

slot only14.6%

time syntax1.3%

subject-verb agreement

5.7%

more syntax4%

plural+options

2%

disfluency4.3%

keyword problem8.1%

value+options

1%

missing is/are

11%

endpoint1.6%

value only6.7%

out-of-vocabulary

concept5.1%

out-of-vocabulary word

14.0%

general syntax20.6%

Future Interface Design Work

• Redesign Help facility– SG works best for those who "get it"– Current system provides no assistance to "clueless user"

• Error analysis– Compare failure cases in SG and NL interfaces– Compare user recovery attempts in SG and NL

• Address issues of generalizability– Promoting transparency of slot set and response sets– Accessing information sets rather than single items

• Adjust grammar components

Future Architecture Work

• Integrate current USI environments– Information Access– Device Control

• Improve interface between PUC and USI components

• Identify USI-specific techniques to achieve lower WER

• Improved documentation and distribution packaging

Tech Transfer Initiative

Tech Transfer Initiative

• Tools for creating new USI apps– 3 days to create a new application– prior exposure to speech technology highly

beneficial– decided to further reduce the barrier create an application generator

From 3 Days to a Few Hours

• A USI Application Generator• New USI applications w/out programming!• XML document fully specifies the

application– slot names– accepted inputs– data types– slot properties– ...

From a Few Hours to 15 minutes?

• Created a Web interface to generating the XML document

• Form filling, pulldown menus• Strong effort to further simplify the process,

minimize complexity of form– many defaults– for less common choices, edit the XML doc.

• More importantly, no computer savvy needed

Web Application Generator

• Repository and tool for creating USI database applications

• Abundant online help to guide users through process

• Accessible to anyone with an Internet connection

Web Application Generator

• Two step process:– General specification – Slot-by-slot specification

• choose datatype from built-in list, or create own

• Fully featured system with save, copy, delete functionality

• Hides intricacies of XML document writing• Advanced users have ability to further

alter the final XML document

General Specification screen with help box displayed.

Web Application Generator

• Built-in generic voice; can record own voice• DB backend

– Postgres– Oracle– ODBC (including ASCII files)– Ultimately: web tables

• Platform:– originally: mixed Unix/Windows, telephone based– converted to: pure Windows, telephone or laptop

Transferring USI to PDG members

• We do house calls!– Carnegie Mellon will install USI developer

environment for each interested member and will train member staff in the use of the developer environment

– Provide a short tutorial on USI principles and interface design

Thank you!Pittsburgh Digital Greenhouse