audiosense: a simulation progress report eecs 578 allan spale

AudioSense: A Simulation

Progress Report

EECS 578

Allan Spale

Background of Concept

• Taking the train home and listening to the sounds around me• How would deaf people be able to perceive

the environment?

• What assistance would be useful in helping people adapt to the environment?

Project Goals

• Develop a CAVE application that will simulate aspects of audio perception• Display the text of “speaking” objects in

space• Display the description text of “non-speaking”

objects in space• Display visual cues of multiple sound

sources• Allow the user to selectively listen to different

sound sources

Topics in the Project

• Augmented reality• Illustrated by objects in a virtual environment

• 3D sound• Simulated by an object’s interaction property

• Speech recognition• Simulated by text near the object• Will remain static during simulation

• Virtual reality / CAVE• Method for presenting the project• Not discussed in this presentation

Augmented Reality

• Definition• “…provides means of intuitive information

presentation for enhancing situational awareness and perception by exploiting the natural and familiar human interaction modalities with the environment.”

-- Behringer et al. 1999

Augmented Reality:Device Diagnostics

• Architecture components aid in performing a diagnostic tests• Computer vision used to track the object in

space• Speech recognition (command-style) used for

user interface• 3D graphics (wireframe and shaded objects) to

illustrate an object’s internal structure• 3D audio emits from an item that allows the

user to find the location within the object

Augmented Reality

• Device

diagnostics

Augmented Reality:Device Diagnostics

• Summary• Providing 3D graphics and sound helps the

user better diagnose items• Might also want text information on the display

• Tracking methodology still needs improvement

• Speech recognition of commands could be expanded to include annotation

• Utilize IP connection to distribute computing power from the wearable computer

Augmented Reality:Multimedia Presentations in the Real World

• Mobile Augmented Reality System (MARS)• Tracking performed by Global Positioning

System (GPS) and another device• Display is a see-through and head-

mounted • Interaction based on location and gaze

• Additional interaction provided by hand-held device


• System overview• Selection occurs through proximity or gaze

direction followed by a menu system• Information presentation

• Video (on hand-held deivce) or images accompanied by narration (on head-mounted display)

• Virtual reality (for places that are not able to be visited)

• Augmented reality (illustrate where items were)

Augmented Reality

• Multimedia

presentations

in the

real world


• Conclusions• Current system is too heavy and visually

undesirable• Might want to make hand-held display a palm-

top computer

• Permit authoring of content

• Create a collaboration between indoor and outdoor system users

3D Sound:Audio-only Web Browsing

• Must overcome difficulties with utilizing 3D sound• X axis sounds identifiable, Y and Z axes sounds

are not identifiable

• Need exists to create structure in audio rendered web pages• Document reading appears spatially from left to

right in an adequate amount of time• Utilize earcons and selective listening• Provide meta-content for quick document overview

3D Sound

• Audio-only

Web browsing

3D Sound:Audio-only Web Browsing

• Future work• Improve link information that extends

beyond web page title and time duration

• Benefits of auditory browsing aids• Improved comprehension

• Better browsing experience for visually impaired and sited users

3D Sound:Interactive 3D Sound Hyperstories

• Hyperstories• Story occurring in a hypermedia context

• Forms a “nested context model”

• World objects can be passive, active, static, or dynamic


• AudioDoom• Like computer game of Doom, but different• All world objects represented with sound• Sound represented in a “volume” almost

parallel to the user’s eyes• User interacts with the world objects using

an ultrasonic joystick with haptic functionality

• Organized by partitioned spaces

3D Sound

• Interactive

3D sound

hyperstories


• Despite elapsed time between sessions, users remembered the world structure well

• Authors illustrate the possibility of “render[ing] a spatial navigable structure by using only spatialized sound.”

• Opens the possibilities for educational software for the blind within the hyperstory context

Speech Recognition:Media retrieval and indexing• Problems with media retrieval and

indexing• Lots of media being generated; too costly

and time-consuming to index manually

• Ideal system design• Speaker independence• Noisy-recording environment capability• Open vocabulary

Speech Recognition:Media retrieval and indexing• Using Hidden Markov Models the

system achieved the results in Table 1

• To improve results, “using string matching techniques” will help overcome recognition stream errors

Speech Recognition:Media retrieval and indexing• String matching strategy

• Develop the search term• Divide the recognition stream into a set of

sub-strings• Implement an initial filter process• “Identify edit operations for remaining sub-

strings in [the] recognition stream”• Calculate the similarity measure for the

search term and matched strings

Speech Recognition

• Media retrieval and indexing

Speech Recognition:Media retrieval and indexing• Results of implementing the string

matching strategy• Permitting more operations improved recall

performance but degraded precision performance

• Despite low performance rates, a system performing these tasks will be commercially viable

Speech Recognition:Continuous Speech Recognition

• Problems with continuous speech recognition• Has unpredictable errors that are unlike

other “predictable” user input errors• The absence of context aids makes

recognition difficult for the computer• Speech user interfaces are still in a

developmental stage and will improve over time


• Two modes• Keyboard-mouse and speech

• Two tasks• Composition and transcription

• Results• Keyboard-mouse tasks were faster and

more efficient than speech tasks


• Correction methods• Two general correction methods

• Inline correction, separate proofreading

• Speech inline correction methods• Select text and reenter, delete text and reenter,

use correction box, correct problems during correction

Speech Recognition

• Continuous speech recognition


• Discussion of errors• Inline correction is preferred by users

regardless of modality• Proofreading had increased usage with speech

because of unpredictable system errors• Keyboard-mouse involved deleting and

reentering the word• Despite ability to correct inline with speech,

errors typically occurred during correction• Dialog boxes used as a last resort


• Discussion of results• Users still do not feel that they can be

productive using a speech interface for continuous recognition

• More studies must be conducted to improve the speech interface for users

Project Implementation

• Write a CAVE application using YG• 3D objects simulate sound producing

objects• No speech recognition will occur since

predefined text will be attached to each object

• Objects will move in space

• Objects will not always produce sound

• Objects may not be in the line of sight


• Write a CAVE application using YG• Sound location

• Show directional vectors for each object that emits a sound

– Longer the vector, the farther away the object is from the user

– X, Y will use arrowheads, Z will use dot / "X" symbol

– Dot is for an object behind the user, "X" symbol is for an object in front of the user

– Only visible if sound can be “heard” by the user


• Write a CAVE application using YG• Sound properties

• Represented using a square

• Size represents volume/amplitude (probably will not consider distance that affects volume)

• Color represents pitch/frequency

• Only visible if sound can be “heard” by the user


• Write a CAVE application using YG• Simulate “cocktail party effect”

• Allow user to enlarge text from an object that is far away

• Provide configuration section to ignore certain sound properties

– Volume/amplitude– Pitch/frequency

Project Tasks Completed

• Basic project design• Have read some documentation about YG• Tested functionality of YG in my account• Established contacts with people that have

programmed CAVE applications using YG• Will provide 3D models and code that

demonstrates some functionalities of YG features upon request

• Will help with answering questions and demonstrating and explaining features of YG

Project Timeline

• Week of March 25• Practice modifying existing YG programs

• Collect needed 3D models for program

• Week of April 1• Code objects and their accompanying text

• Implement movement patterns for objects

Project Timeline

• Week of April 8• Attempt to “turn on and off” the sound of objects

• Work with interaction properties of objects that will determine visualizing sound properties

• Week of April 15• Continue working on visualizing sound properties

• Work on “enlarging/reducing” text of an object

Project Timeline

• Week of April 22• Create simple sound filtering menus

• Test program in CAVE

• EXAM WEEK: Week of April 29• Practice presentation

• Present project

Bibliography

Behringer, R., Chen, S., Sundareswaran, V., Wang, K., and Vassiliou, M. (1998). A Novel Interface for Device Diagnostics Using Speech Recognition, Augmented Reality Visualization, and 3D Audio Auralization, in Proceedings of IEEE International Conference on Multimedia Computing and Systems Vol I, Institute of Electrical and Electronics Engineers, Inc., 427-432.

Goose, S. and Moller, C. (1999). A 3D Audio Only Interactive Web Browser: Using Spatialization to Convey Hypermedia Document Structure, in Proceedings of the seventh ACM international conference on Multimedia (Orlando FL, October 1999), ACM Press, 363-371.

Bibliography

Hollerer, T., Feiner, S., and Pavlik, J. (1998). Situated Documentaries: Embedding Multimedia Presentations in the Real World, in Proceedings of the 3rd International Symposium on Wearable Computers (October 1999, San Francisco CA), Institute of Electrical and Electronics Engineers, Inc., 1-8.

Karat, C.-M., Halverson, C., Horn, D., and Karat, J. (1999). Patterns of Entry and Correction in Large Vocabulary Continuous Speech Recognition Systems, in CHI '99, Proceeding of the CHI 99 conference on Human factors in computing systems: the CHI is the limit (Pittsburgh PA, May 1999), ACM Press, 568-575.

Bibliography

Lumbreras, M., Sanchez, J. (1999). Interactive 3D Sound Hyperstories for Blind Children, in CHI '99, Proceeding of the CHI 99 conference on Human factors in computing systems: the CHI is the limit (Pittsburgh PA, May 1999), ACM Press, 318-325.

Robetison, J., Wong, W. Y., Chung, C., Kim, D. K. (1998). Automatic Speech Recognition for Generalised Time Based Media Retrieval and Indexing, in Proceedings of the sixth ACM international conference on Multimedia (Bristol UK, September 1998), ACM Press, 241-246.

audiosense: a simulation progress report eecs 578 allan spale

Documents

visitedaugmented reality

text of speaking objects

shaded objects

multimedia presentations

text information

headmounted interaction

outdoor system users3d

gazeadditional interaction