design decisions for interactive environments: evaluating ...€¦ · ers designing such...

Design Decisions for Interactive Environments: Evaluating the KidsRoom

Aaron Bobick, Stephen Intiile, Jim DavisFreedom Baird, Claudio Pinhanez

Lee Campbell, Yuri Ivanov, Arian Schiitte, Andrew Wilson

MIT Media Laboratory20 Ames Street

Cambridge, MA 02139kidsroom@media, mit. edu

Abstract

We believe the KidsRoom is the first multi-person,fully-automated, interactive, narrative environmentever constructed using non-encumbering sensors. Theperceptual system that drives the KidsRoom is outlinedelsewhere(Bobick et al. 1996). This paper describesour design goals, successes, and failures including sev-eral general observations that may be of interest toother designers of perceptually-based interactive en-vironments. I

Introduction and GoalsWe are investigating the technologies required to buildperceptually-based interactive and immersive spaces --spaces that respond to people’s actions in a real, physicalspace by augmenting the environment with graphics, video,sound effects, light, music, and narration.

In this paper we describe observations made as we con-structed and tested one such space: the KidsRoom. Ourhope is that the lessons we learned will be valuable to oth-ers designing such perceptually-controlled, interactive andimmersive spaces.

The initial goal of our group was the construction ofan environment that would demonstrate various computervision technologies for the automatic recognition of actionby machine. As we began to enumerate the design criteriafor this project it became clear that this effort was going tobe an exploration and experiment in the design of interactivespaces. The goals that shaped our choice of domain werethe following:

¯ To keep action in the physical space.

¯ To use vision-based remote sensing to encourage unen-cumbered, natural interaction.

¯ To construct a system that worked effectively with mul-tiple people.

¯ To reduce the brittleness of perceptual sensing by allow-ing the system to both use and manipulate context.

]A demonstration of the project, including videos, im-ages, and sounds from each part of the story is availableat http://vismod.www.media.mit.edu/vismod/demos/kidsroomand complements the material presented here.

¯ To create a space that was immersive, engaging, and fully-autonomous and that could be interacted with naturally,without prior instruction.

The idea of constructing a children’s "imaginationplayspace" immediately addressed most of these concerns:multiple children being active in an engaging activity. Thegoal of exploiting and controlling context was accommo-dated by embedding the experience in a narrative envi-ronment, where there was a natural storyline to drive thesituation.2 Finally, by building a large scale room, we couldinsure an immersive experience and provide the necessaryviewing conditions to allow the use of visual sensing.

The playspace, story, and interactivityThe KidsRoom re-creates a child’s bedroom. The space is24 x 18 feet with a wire-grid ceiling 27 feet high. Twoof the bedroom walls resemble the real walls in a child’sroom, complete with real furniture, posters, and windowframes. The other two walls are large video projectionscreens, where images are back-projected from outside ofthe room. Behind the screens is a computer cluster with sixmachines that automatically control the room. Computer-controlled theatrical colored lights on the ceiling illuminatethe space. Four speakers, one on each wall, project direc-tional sound effects and music into the space. Some effectsare particularly loud and can vibrate the floor. Finally, thereare three video cameras and one microphone installed. Theroom contains several pieces of real furniture including amovable bed, which is used throughout the story. Figure 1shows a view of the complete KidsRoom installation. TheKidsRoom experience lasts 10-12 minutes, depending uponhow the participants act in the room, and it was designedprimarily for children ages 6-10 years old. Throughout thestory, children interact with objects in the room, with oneanother, and with virtual creatures projected onto the walls.The actions and interactions of the children drive the nar-rative action forward. And, most importantly, the childrenare aware that the room is responsive.

The KidsRoom was inspired by children’s stories inwhich children are transported from their bedrooms to mag-

2An example of a non-narrative space would be a kitchen thatwatches what people do and responds by offering advice.

From: AAAI Technical Report SS-98-02. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved.

Figure 1: The KidsRoom is a 24 by 18 foot space constructedin our lab. This image shows the entrance to the space, the twoprojection screens, the computer cluster outside the space, and theposition of two of the cameras used for sensing.

ical places. Here we briefly describe the experience (see(Bobick et al. 1996) for a detailed description of the worldsand the interaction of participants).

In the first world, children enter the bedroom being toldonly to "ask the furniture for the magic word." A trackingalgorithm(Intille, Davis, & Bobick 1997) is used to deter-mine when children are near pieces of furniture, each whichhas a different "personality" For example, a chest says,"Aye, matey, I’m the pirate chest. I don’t know the magicword, but the frog on the rug might know." The childrenthen run to the rug with the frog painted on it, and the "frogrug" may send send them to yet another piece of furniture.If the children get confused and don’t go to the correct place,the system will respond by having some furniture charactercall the children over, "Hey, over here, it’s me, the yellowshelf!" Eventually, the children discover the magic word.A mother’s voice breaks in, silencing the furniture voices,and tells the kids to stop making noise and to go to bed.When they do, the lights drop down, and a spot on one wallis highlighted, which contains the image of a stuffed mon-ster doll. The monster starts blinking and speaks, askingthe children to loudly yell the magic word to go on a bigadventure.

After the kids yell the magic word loudly, 3 the roomdarkens and the transformation occurs. As colored-lightsflash and mysterious music plays, images on the walls fadeto images of a cartoon fantasy forest land. Simultaneously,a deep-voiced narrator -- the voice of the room -- says,"Welcome to the KidsRoom. It’s not what it seems. Whatyou might see here are things dreamt in your dreams." Thenarrator speaks in rhyme throughout the story.

As the lights come up, the narrator tells the childrenthat they are in the "forest deep," that monsters are near,and that they must follow the path to the river. A stone-path is marked on the floor of the room and the childrenquickly assume that is the path they should follow. Theroom provides encouraging narration if they don’t do so,

3There is no speech recognition capability, with only the volumeof sound being measured. In no run of the KidsRoom did thechildren ever yell anything but the magic word.

and instructs them to to stay in a group and to remain onthat path. If they deviate from the path "hints" (discussedbelow) are given to induce the behavior. Later, the childrenmust hide behind the bed from roaring monsters.

After a short walk, the children reach the river world. Oneview shows the river progressing forward and the other viewshows the moving riverbank. In this world the children geton the bed and control it to avoid obstacles in the river bymaking rowing motions, recognized by a motion detectionalgorithm(Bobick et al. 1996) as shown in Figure 2-a. Thecontext of the story is used to control the interaction. Forinstance, if someone gets off the boat, a splashing soundis heard and the narrator yells, "Passenger overboard" andencourages the child to get back on the bed.

Eventually, the children reach the monster world. In thisworld they land the "boat" by pushing the bed and thenstill-frame animated creatures teach the children four dancemoves. The monsters, shown in Figure 2b, are larger thanthe children and have a friendly, goofy, cartoon look. Whenthe appear, the kids must yell to quiet them. The kids learnthe dance moves as they stand on rugs in the room. Com-puter vision action recognition algorithms(Davis & Bobick1997; Bobick et al. 1996) are used to recognize when chil-dren do the moves, and the characters give feedback like"Yo! Kid on the red rug, you dance like a pro!" Eventually,if the children do one of the four moves they learned, themonsters will copy (see Figure 2-b). The story ends with themother once again breaking in and telling the kids to go tobed, at which time the room transforms back to a bedroom(see Figure 2-c).

Achieving project goals

We review the goals presented in the first section of this pa-per, considering not only how well the goals were achievedbut also the implication those goals had on the develop-ment of perceptual algorithms and the overall success of theproject.

Real action, real objects

One of our primary goals was to construct an environmentwhere action and attention was focused primarily in the realphysical environment, not on the virtual screen environment.We wanted the room to be a physical space, where childrencould play much as they do in real life, only within a richenvironment that would watch what they do and respond innatural ways.

We believe the KidsRoom achieves this goal. Childrenare typically active when they are in the space, running fromplace to place, dancing, and acting out rowing and exploringfantasies. The kids are interacting with each other as muchas they are interacting with the virtual objects, and theirexploration of the real space and the transformation of realobjects (e.g. the bed) adds to the excitement of the play.When action on the screens captures the attention of thechildren, the children are still focused partly on the realspace as they perform large body actions and observe andcommunicate with other children.

(a) (b) (c)

Figure 2: (a) A child and mother row the boat together. (b) Children dancing with monsters. (c) Children "going to

We were only partially successful in the use of real objectsto enhance the experience. The often, but only, manipulatedobject in the KidsRoom is the bed; it can be rolled around,jumped over, and hid behind, and is a critical part of thenarrative. However, two major obstacles ~ tracking andnarrative control -- prevented us from incorporating moreobjects into the room.

The KidsRoom tracking algorithm(Intille, Davis, & Bo-bick 1997) sets a limit of four people and one bed in thespace because when more participants or large objects arepresent the space becomes visually cluttered. Small objects,conversely, are difficult to track because they are easily oc-cluded and often have small visual cues. Better techniquesor even paradigms are needed to track multiple people andobjects in small enclosed spaces such as typical householdand office environments where there can be hundreds ofmanipulatable objects in just a few cubic feet of space.

The second and perhaps more serious obstacle is narrativecontrol. As more objects are added to a space, the behav-ior of the people in the space will become less predictablesince the number of ways in which objects may be used (ormisused) increases. Further, objects become more likelyto interfere with perceptual routines. For the room to ade-quately model and respond to all of these scenarios it willrequire both especially clever story design and tremendousamounts of narration and control code. The more person-object interactions that the system fails to handle in a naturalway, the less immersive and sentient the entire system feels.

Remote visual sensingIn the KidsRoom there are no incumbrances or requirementsof people who enter the space except that they must enterone at a time. Cameras and a single, stationary micro-phone are the only forms of input - and all are completelynon-intrusive. Remote sensing facilitates the activity re-maining in the physical space, not requiring users to wearsensors, head-mounted displays, earphones, microphones,or specially-colored clothing.

Why not use sensors embedded in objects in the space?For example, we might have used touch sensors on furnitureand in the floor. First, these types of sensors do not allowthe system to observe and interpret the physical interactionof and between the people in the space. Second, physicallywiring every object in a complex space with several differ-ent devices makes modification to the space and objects a

hardware procedure, increasing experiment cycle time andexpense. A single camera with the appropriate visual pro-cessing software may be able to do the job of thousandsof distributed sensors. Finally, even if every device in aroom is wired with sensing devices, the difficult problem ofunderstanding what is happening in the space has not beensolved. Vision sensing can provide both the global and localinformation required to interpret activities.

Further, occupants in the room (particularly young chil-dren) are not even aware of how the room is sensing theirbehavior. There are no obvious sensors in the room embed-ded in any objects or the floor. The cameras are positionedhigh above the space, well out of the line-of sight and visi-ble only if someone is looking for them. This increases themagical nature of the room for all visitors, especially forchildren. They are not pushing buttons or sensors, they arejust being themselves, and the room is responding.

As a practical implication, using visual sensing meansthat adding a new element to the story that requires a newaction to be sensed is not a hardware operation. One in-stalled, the sensing infrastructure remained stable eliminat-ing potentially combinatorially difficult networking issues.

Multiple, collaborating peoplePrevious work in computer vision and fully-automated inter-active spaces has primarily considered environments con-taining only one and sometimes two people(Maes et al.1994). However, immersive spaces for living environmentsor office environments will require systems that understandthe interaction of multiple people. Further, the focus of at-tention in a room is more likely to remain in the physicalspace if there are multiple people in the environment since, ifunencumbered by head-mounted displays, people will nat-urally communicate with each other about the experience asit takes place, and they will watch and mimic one another’sbehavior. This is clearly the case in the KidsRoom.

Since self-consciousness seems to decrease as group sizeincreases, the kind of role-playing encouraged by the Kids-Room is most natural and fun with a group. Much of thefun is being in the room with other people and "playing"together. For instance, during the rowing scene, childrenshout to one another about what to do, how fast to row,and where to row and play-act together as they hit virtualobstacles.

However, being responsive to the actions and interactions

of multiple people required greater use of context specifictechniques. The more people that participate in an activity,the less likely one can accurately describe the appearanceof that action. However, cues that are indicative of behaviorcan be safely used within a context. An example in theKidsRoom is rowing, when it is possible to use context toenforce constraints needed to recognize a complex groupactivity (i.e. rowing) with a simple global motion detectionalgorithm. Without using the context of the story to en-courage certain behavior (e.g. getting everyone on the bed)and allow the use of context-selected perceptual methodsthe multi-person activity would be much more difficult torecognize.

Multiple people also increases the complexity of visualsensing because of the presence of visual occlusions. Asmore people enter the space, some will block the view ofothers from particular cameras. In the monster world of theKidsRoom, a large space was required to get a non-occludedview of just two children. The KidsRoom does all of itsrecognition of group activity using positional informationfrom the overhead camera view in which occlusion doesnot occur. However, there are currently no mechanismsthat allow recognition of large-body group activity such astwo children dancing holding hands and two people shakinghands.

In addition to complicating visual tasks, the presence ofmultiple people increases the narrative complexity. To par-ticipants, the room appears interested in group activity: "Iseveryone on the bed? .... Is everyone in a group? .... Is ev-eryone rowing on the correct side? .... Is everyone on thepath?" However, the more people present the greater thelikelihood one person will engage in an unpredictable orinappropriate manner. The narrative control needs to antic-ipate such situations and be able to respond non-repetitivelywhen necessary.

Exploiting and controlling context

Computer vision techniques are still brittle. Given the diffi-culty of designing robust perceptual systems for recognizingaction in complex environments, we strove to use narrativeto provide context for the vision algorithms. Most of thevision algorithms are dependent upon the story to provideconstraint, for instance the boat rowing mentioned in theprevious section.

Another example is the monster dance scene where thestory ensured that each camera has a non-occluded view ofa child performing actions. Potentially interfering childrenare cajoled by the monsters to stand in locations (on coloredrugs) that do not interfere with the sensing. The advantageof an active system over that of a monitoring situation is theopportunity to not only know the context but to control it aswell (see (Bobick et al. 1996) for other examples).

Immersion, story, and consistencyWe wanted an environment that was truly immersive, com-pletely perceptually and cognitively engaging, and whereparticipants did not need to ask for outside help. The ex-perience should be compelling in the sense that the users

should be more concerned with their own actions and be-haviors than how the interactive system works. Finally, itwas desired that the system not break suspension of disbeliefby doing something either inappropriate (e.g. encouragingusers to stand still even though they are not moving) orovertly computer-like (repeating identical warnings or ad-monishments many times).

The immersive power of a compelling storyline cannotbe overstated when constructing a space like the KidsRoomthat integrates technology and narrative. The presence ofa story seems to make people, particularly children, morelikely to cooperate with the room than resist it and test itslimits. In the KidsRoom a well-crafted story was used tomake participants more likely to suspend disbelief and morecurious and less apprehensive about what will happen next.The story ties the physical space, the participant’s actions,and the different output media together into a coherent, rich,and therefore immersive experience.

An immersive space is most engaging when participantsbelieve their actions are having an effect upon the environ-ment by influencing the story. The KidsRoom uses com-puter vision to achieve this goal by making the room respon-sive to the position, movements, and actions of the children.The room responds when children are near particular piecesof furniture. The room transforms the instant that everyonejumps on the bed, and the kids can move and steer their"magic boat" using natural body gestures. Monsters fol-low the dance moves of the children,; monsters growlingstop when children hide behind the bed. Monsters reactwhen children scream loudly, but the narrator reacts whenthey don’t. Immediately upon their first interaction with theroom the children realize that what they do makes a differ-ence in how the room responds. This perceptual sensingenhances and energizes the narrative.

A goal that was critical to obtaining the immersive feelof the KidsRoom was to naturally embed the perceptualconstraints into the storyline. For example, in the monsterdance scene, the vision systems require that there is onlyone child per rug and that all children are on some rug.One way to impose this constraint would have been to havethe narrator say, "Only one kid per rug. Everyone must beon a rug." A less obtrusive choice, however, is to embedthe constraint within the behavior naturally assumed of thecharacters. The monsters tell the kids to stay on the rugsand that there can be only one kid per rug "so’s we cansee what’s goin’ on." That the monsters have some visualconstraints seems perfectly natural and makes children lesslikely to feel restrained or to question why they need toengage in some particular behavior.

No matter how well an interactive storyline is designed,participants, especially children, will do the unexpected --especially when there are up to four of them interactingtogether. This unpredictable behavior can cause the percep-tual system to perform poorly. Therefore, we designed thestory so that such errors would not completely destroy theentire immersive experience. When perceptual algorithmsfail the behavior of the entire room degrades gracefully.

One example of this principle in the KidsRoom is in

10

the way the vision system provides feedback during themonster dance. If a child is ignoring the instructions of thecharacters and is too close to another child on a rug, therecognition of the movements of the child on the rug willbe poor. Therefore, when actions are not recognized withhigh confidence, the monsters on the screen will animate,doing the low-confidence action, but the monster will notsay anything. To the child, this just appears as though themonster is doing its own thing; it does not appear that theroom is any way confused. This choice was preferred overthe possibility that the monster says "Great crouch" whilethe kid is actually spinning.

Similarly, we tried to minimize the number of story seg-ments that required a specific, singular action on the part ofall the participants. For instance, to get a particular pieceof furniture to talk, only one child needs to be near it. Ifall children were required to do it they might never discoverhow to make the furniture talk. When a particular actionis required, the story components are designed so that thesystem can handle a problematic participant like a grouchychild or a mischievous graduate student. In such cases, thesystem tries to acknowledge that someone is doing some-thing that they shouldn’t, but if the person continues tocause trouble, the system will continue to progress throughthe story anyway.

Finally, to give the KidsRoom narrative a cohesive, im-mersive feel, there are thematic threads that run throughoutthe story. For example, stuffed animals on the walls in thechildren’s bedroom are similar to the monsters that appearlater in the story. Some of the furniture characters in thefirst world have the same voices as the monsters in mon-ster world. The artwork has the same storybook motif inall four scenes. Some objects in the room on the shelvesbecome part of the forest world backdrop during the trans-formation. Although we are uncertain if kids and adultsconsciously notice these minor story points, attention tosuch detail minimizes the chance that some awkward storyplot will jar some participant out of the immersion that thespace is continually trying to generate.

Children as subjects

Davenport and Friedlander (Davenport & Friedlander 1995)and Druin and Perlin (Druin & Perlin 1994) both observedthat adults visiting their interactive installations sometimeshad difficulty immersing themselves in the narrative. Dav-enport and Friedlander found that it was difficult to makesome people comfortable so they would "play along" with-out getting embarrassed. Children already like to play witheach other in real spaces engaging in rigorous physical ac-tivity, and we thought that they would be more likely thanadults to be motivated by supplementary imagery, soundeffects, music, and lighting.

Building a space for children was both wonderful andproblematic. The positives include the tremendous enthu-siasm with which the children participate, their willingnessto play with peers they do not know, the delight they expe-rience at being complimented by virtual monsters, and theircomplete disregard of small technical embarrassments that

arose during development.Children did provide unique challenges as well. The

behavior of children, particularly their group behavior, isdifficult to predict. Further, children have short attentionspans and often move about with explosive energy, leavingthe longer playing narrations behind. Young children aresmall compared to adults, which can create problems fordeveloping vision algorithms. Having children as subjectsmade it especially difficult to iteratively develop and testeach component of the system on the primary users. Kidswould tire quickly of any sort of repetitive testing of a par-ticular component of the room. But, there is no questionthat the magic of the children’s play is what made the magicof the KidsRoom come to life.

Visual and audio feedback

To further achieve the goal of the keeping the action in thereal, as opposed to virtual, space, we designed the visualand audio feedback to only minimally focus the attentionof the children. Typically, virtual reality systems use semi-realistic three-dimensional rendered scenes and video as theprimary form of system feedback. We decided, however,that in order to give the room a magical, theatrical feel and inorder to keep the emphasis of the space in the room and noton the screens, images would have a two-dimensional sto-rybook look and video would consist of simple, still-frameanimations of those images. Although the animations aresimple, most consisting of fewer than five frames, smalleffects can make the characters more interesting. For ex-ample, when the monsters are standing still, they blink,giving them a more life-like appearance. During much ofthe KidsRoom experience, the video screens are used asmood-setting backdrops and are not always the center of theparticipants’ attention.

Audio is the main form of feedback in the room and hasseveral advantages over video. Realistic or semi-realisticthree-dimensional video animation, even of cartoon mon-sters, is difficult to do well. It is much easier to obtain richand realistic sound effects and narration. More importantly,sound does not require participants to focus on any particularpart of the room. Children are free to listen to music, soundeffects, and narration as they play, run about the space, andtalk to one another. During the scenes where sound is theprimary output mechanism, such as the bedroom and forestworlds, the children are focussed on their own activity inthe space. Finally, combining ambient sound effects withappropriate music can set a tone for the entire space. Audiofeedback can be further enhanced by using spatial localiza-tion. Even with just four speakers, the KidsRoom monstergrowls sound like they are coming from the forest side of theroom, and when the furniture speaks the sound originatesfrom approximately the correct part of the room.

One of our goals was to create an "imagination space" -a place that stimulates creative play by creating a rich envi-ronment but one that leaves much to the child’s imagination.Audio feedback is a powerful tool because it can be used toinform participants of story events while leaving gaps thatchildren can fill in as they like, much as they do in their

everyday play.

Physical constraintsA major challenge when designing the KidsRoom was tominimize the impact of our physical constraints on the de-velopment of an interesting story. This section expands on afew constraints that impact the design of perceptually-basedimmersive environments.

LightingLighting is often used effectively in theater to set a moodand highlight an important event. In fact, lighting has beenused extensively in much of the previous work on immersiveenvironments(see (Bobick et al. 1996) for a review). How-ever, systems that use visual input have been constrained tooperating in relatively bright, constantly lit environments.

Unfortunately, bright white lighting, ideal for computervision, requires a serious compromise on the part of the de-signer of an immersive space. Controlled lighting can be anespecially effective way to influence the physical world in anatural, but powerful way. In the KidsRoom, the only ma-jor lighting effect, the blackout and colored lighting duringthe transformation from bedroom to forest worlds, clearlycaptures the attention of both children and adults. Prior tothe blackout, after running around the room "talking" to fur-niture, children will often be rather hyper. Only momentslater, however, after passing a few seconds in the dark onthe bed, they are typically more calm.

Large projected displays appear dim when placed inbright spaces. Although we obtained satisfactory bright-ness using light blinders, the forest, river, and monster worldwould seem more mysterious if the room were darker andthe forest screens brighter. Unfortunately, bright screensreflect more light back into the room which sometimes de-grade performance of color-based vision algorithms.

Projection screens are light sourcesOne serious obstacle we encountered when developing thespace was the problematic interaction between video dis-plays and computer vision algorithms. Many real-timecomputer vision algorithms require the use of backgroundsubtraction, as do all of the algorithms used by the Kids-Room. However, background subtraction assumes that thebackground is static. Large video screens with changingimages violate this assumption. We were forced, therefore,to choose camera and rug positions so that people in theroom would never appear in front of a screen in the imageviews - a serious limitation for a space with two walls thatare screens.

One problem for those developing immersive environ-ments is that large screens are an effective way to providefeedback. Ideally they should be distributed around theroom. However, in an environment like the cave (Cruz-Neira, Sandin, & DeFanti 1993), where all four walls arescreens, few camera angles would be suitable for our visionalgorithms and those that were would not show useful pro-file views of people near the screens. This limitation mustbe overcome before vision algorithms can be used to their

full potential in such immersive spaces. One solution be-ing studied is using real-time stereo vision for backgroundsegmentation (Ivanov, Bobick, & Lui 1998).

Physical layout

Significant thought and testing was required to physicallylayout the KidsRoom space, and perception-based require-ments had to be carefully considered when developing thenarrative. The problem was finding a way to mount thevideo cameras so that the action that needed to be recog-nized could be interpreted by the vision systems. Even in aspace as large as 24 feet by 18 feet with a high ceiling, cam-era placement was tricky. The top-down camera makes itpossible for the tracking algorithm and motion-energy algo-rithm to operate without explicit reasoning about occlusion.However, the trade-off is that our algorithms were devel-oped with the assumption that a camera could be placedover 20 feet above the room, which is out of the question formost normal indoor, household spaces. Further, we cannoteasily hang an object, lighting fixtures for example, abovethe space since it will block the camera’s view.

The other two cameras used for recognition on the redand green rugs severely constrain the layout of the room.In fact, keeping the rugs away from the screen, away fromeach other, and within the camera view meant there wasonly one configuration in which they could be placed. Thebed and the back two rugs were also under tight positionalconstraints.

Every space-related decision required careful consider-ation of imaging requirements. All objects other than thebed had to be carefully fastened. Rugs and carpet had to beshort-haired, not the least bit shaggy, to prevent the back-ground from changing as people moved around. Objectswere painted in fiat paint to minimize specularity from thebright lighting and transfer of color from objects to peopleby reflection. Any object in the line-of-sight of any cameracould not have specular metal parts. Even the sheets on thebed were tightly fastened to the frame so its image wouldnot change.

The use of narrative in the KidsRoom made it possiblefor us to overcome these limitations by carefully integratingvisual constraints into the story constraints. However, thesetype of stringent requirements are clearly limiting for theinteractive story designer.

Unencumbering audioConstraints on audio processing currently restrict the type ofstories that could be developed for the KidsRoom. The firstconstraint is lack of effective sound localization. The fourspeakers in the KidsRoom provide reasonable localizationwhen listeners are near the center of the room. However,when a participant is near a loudspeaker playing a sound,that single speaker tends to dominate the positional perceptand the spatial illusion breaks down. Sound localization isimportant if real objects in a space are to be given "person-alities" using sound effects, because the effect is destroyedif the sound is not perceived to come from the object. Onesolution to this problem is to populate the room with many

small speakers concealed carefully in walls and some largedevices so that sound can really can come directly from a"talking" object.

The second audio issue left for future systems is the use ofspeech recognition. Unlike some other immersive environ-ments that require portable microphones (Coen 1997), theKidsRoom has no mechanisms for understanding speech.Although large-vocabulary speaker-dependent and small-vocabulary speaker-independent continuous speech recog-nition systems are available, they generally require the userto wear a noise-canceling headset microphone. Adding un-encumbering speech recognition capability to a space likethe KidsRoom with loud music, loud sound effects, andloud children remains a significant challenge. Clearly, theability to recognize some speech would expand the typesof interactive rooms that can be constructed, since somefeedback from the participants to the room is accomplishedmost naturally using speech not visual gesturing.

Observations and failures

There were some important issues that we failed to considerin the design phase of the KidsRoom that are important fordeveloping other immersive spaces, particularly those forchildren. We present several in an effort to prevent othersfrom repeating our mistakes.

Group vs. individual activity

The interaction in the KidsRoom changes significantly de-pending upon the number of people in the space. First, asmentioned previously, all system timings differ dependingupon the number of people in the room. There is only a smallwindow of time outside of which each unit of the experiencebecomes too short or too long- and the ideal timing changesbased on the number of people around. Since automaticallysensing when people are getting bored is well beyond ourcurrent perception capability, the KidsRoom uses an ad hocprocedure to adjust the duration of many activities depend-ing upon the number of people in the space.

In general, the more children that are in the space, themore fast paced the room appears to be, since as soon asingle child figures out the cause and effect relationshipbetween some activity and response the other children willtbllow. A single child is more hesitant and therefore needsmore time to explore before the room interjects. Also, alone child often requires more intervention from the systemto guide him or her through the experience.

A final consideration when developing for group activityis the importance of participants being able to understandcause and effect relationships. If too much is happening inthe room and there is not a reasonable expectation within thechild’s mind of strong correlation between some action and areasonable response, the child will not understand that he orshe has caused the action to happen. Our failure to considerthe importance of clear cause and effect relationships isdiscussed in the next section.

Exploratory vs. linear spaces

In our initial design of the KidsRoom, we had plannedto create a mostly exploratory space, modeled somewhaton popular non-linear computer games like Myst (Broder-bund Software 1994). We designed and built prototypes forthe first and second worlds using this model. In the firstworld, there was no talking furniture. Instead, when chil-dren walked near objects they made distinctive sounds. Forexample, stepping on a rug with a frog on it would make a"ribbit" sound and moving near a real or virtual window andputting an arm towards the window generated a loud crashof glass.

Our hope was that children would enter, one at a time,figure out that they could make such sounds, and then ex-plore the room, gradually creating a frenzy of sounds andactivity. It didn’t work. When we brought in some childrenfor testing we found that they did not understand that theywere causing the sounds - there was too much going on.Even when only one child was in the space, the cause andeffect relationship was not explicit enough for the childrento grasp. The same was true for some adults.

Part of the problem was that there is a slight (less than 500ms) lag in the perceptual system. The larger problem, how-ever, was spatial accuracy. We could only tell when a childwas close to the window, so the system could not distinguishbetween a kid standing next to the window making smash-ing gestures and a kid that just ran right by the window.Even if our sensing had been more precise, however, webelieve that with multiple children there would be too muchhappening in the environment to make the space engaging.We completely revised the first world, using the more lineartalking furniture model. We had similar problems with thesecond (forest) world.

Clearly, it is easier for an immersive environment to re-spond to action taking place in a linear story because thenumber of action possibilities at any moment is more con-strained. Moreover, other authors have observed that ex-ploratory, puzzle-solving spaces can sometimes make itdifficult for adults to immerse themselves in an interac-tive world (Druin & Perlin 1994; Davenport & Friedlander1995). When a story is added to the physical environment,a theatrical-like experience is created. Once the theatricalnature of the system is apparent, it is easier for people toimagine their roles and, if they are not too self-conscious,act them out.

Anticipating children’s behaviorFrom testing with children, we learned that there are threeaspects of children’s behavior in the space that we had notadequately considered during narrative development.

First, the story must take into account the children’s be-havioral "momentum." The KidsRoom is capable of makingchildren somewhat hyperactive. By the end of the first bed-room world when the furniture all start loudly chanting themagic word, the kids are often running energetically aroundthe room. Next the kids end up on the bed, scream themagic word loudly, and the transformation occurs. Thenthe kids are in the forest world, and they are expected to

13

explore. However, the transformation typically calms themdown and their tendency is often to stay right where they areon the bed. We found through testing that it takes a fairlydirect instruction (e.g. "Follow the path..."), sometimes re-peated several times, to get them to start moving again. Wefailed to anticipate this behavior and had to later modifythe storyline to improve the interaction. When designingfor a space where physical action is the focus, behavioralmomentum needs to be considered.

A related problem was the need for attention-grabbingcues. Particularly when kids are in a state of high physicalactivity, they almost never hear the first thing that the roomsays to them. Since we did not anticipate this problem,sometimes children missed important instructions. Again,we had to modify the narrative so that it repeats some crit-ical instructions more than once. Ideally, we should havebuilt attention-grabbing narrative into the storyline for ev-ery critical narration. Since most children have not been inany type of "active room" before, preparing them briefly be-fore they enter by telling them to be attentive to the room’sinstructions has proven effective.

Finally, children need to clearly understand the currenttask. The less certain they are of what to do, the moreunpredictable their behavior becomes. The kids seem moreinclined to wait for things to happen than to explore and tryto make things happen, so when the room didn’t provide agoal, they would tend to sit on the bed and talk or just lookaround. The kids seemed to enjoy the experience the mostonce the system was modified so that there was always aclear task and when those tasks changed quickly.

Avoiding repetitiveness

One way to break the immersive experience is for the systemto repeat a single narration as it tries to encourage somebehavior. Unfortunately, in a space like the KidsRoomespecially built to encourage children to physically movearound, instructions do need be repeated because sometimesthey are not heard or are ignored. For example, in thedance segment of the monster world, the control programcontinually checks if someone has stepped off their rug orif two people are on the same rug. If someone drifts offa rug more than once, narration is needed, but repeating anarration just played moments before imparts a mechanicalsense to the responses and causes the entire experience tofeel less immersive.

One solution we developed was to use two different nar-rators. The main narrator has a deep, male voice. Thesecond narrator, with a soft, whispered, female voice, de-livers "hints." The first time someone gets off a rug, themonsters will tell the person to get back on. After that,however, a voice whispers a hint, "Stay on your rug." Thistype of feedback is easily understood by room participantsbut does not break the flow of the story and primary nar-ration. Hints are used throughout the KidsRoom and wereeffective at - in fact, required for - successfully improvingour narrative where our original storyline was problematic.For example, when the kids don’t put the bed back in itsproper position after the river world, they will hear several

hints about putting it back and a hint telling them where toput it.

Hints do not solve the repetition problems, however. MostKidsRoom hints have several variations so that if they mustbe repeated they sound slightly different each time. Mostnarration, though, due to time constraints, does not. It be-came clear that practically every narrative instruction andcertainly every hint needs at least three different varia-tions so the control program can repeat instructions with-out breaking the immersive feel. Ensuring that the systemavoids repetitive narration adds some additional complexityto the control code.

Sensitivity to timing

People are sensitive to small timing problems and the controlarchitecture proved to be sensitive to small changes in timingvalues. Multiple people in a space increases the number ofpossible situations and responses that are required, therebymaking the problem more difficult.

A typical example is as follows: Several kids are on thebed rowing down the river. A narration of a four secondduration begins to play. During the first second, someonegets off the bed. However, the man-overboard narration(which lasts four seconds as well) cannot play because thefirst narration has not yet completed, and the system does notrespond. If the person gets back on the bed quickly, he orshemay not realize the room recognized what happened. Justas problematic is if the person stays off the bed for severalseconds. Then the overboard narration eventually plays butthree seconds too late. The system feels sluggish, eventhough the recognition systems could detect the overboardevent almost instantaneously; the room just had no way torespond. This problem is encountered in many situations,particularly those where several different tests are activesimultaneously.

In general, our narrations are far too long, often takingseveral seconds to read. Since we have no way to interruptnarration, it is difficult to adjust the control timing to avoidsluggish response. Shorter narrations and short sound ef-fects would help, but not solve the problem. Even if wecould cut the sound of a narration mid-phrase, such unnat-ural cutoffs are likely to break the suspension of disbelief.

Another difficulty encountered was the tuning of timerparameters used by the control program (see (Bobick et al.1996) for a discussion of the control architecture). Althoughthe timer controls are fairly straightforward to program, itis time-consuming to tune the narrative control system sothat the room responds naturally throughout each stage ofthe story. For example, in a given segment of the story, itis possible to program timers so that the interaction feels"right" when there is one person in the room following theroom’s instructions. However, when three people are inthe room and they are not cooperating, additional narrationsmay be required since room participants may perform ac-tions differently and take longer or shorter amounts of time.Consequently, the timing can be thrown off and the roomcan start to feel unresponsive. Further, the more people thatare in the room, the more situations the control program

14

must handle. Before long, the number of different situa-tions has forced the programmer to add many special timingconditions that increase the complexity of the program andmake it time-consuming and tricky to adjust. Handling suchproblems while maintaining a feel of quick responsivenesswithout generating repetitive narration has required a largeamount of effort and experimentation. Some of the authorsare working on methods that represent temporal events morenaturally (Pinhanez & Bobick 1997) that might potentiallyhelp identify tricky timing situations automatically or semi-automatically.

The lack of a systematic method of checking for timinginconsistencies was most painfully apparent as we testedthe nearly-completed system. The room provides a 10-12minute experience, and thorough testing of every timingscenario is out of the question due to the large number ofpossible timing situations. Further, once the room is tuned,any small change to any timing-related code requires havingseveral people around to interact in the space.

Wasting sensing knowledgeSometimes the system has the capability to detect somesituation but no way of informing the participants. Giventhe impoverished sensing technology, no knowledge shouldbe wasted - all information should be used to enhance thefeeling of responsiveness.

For example, we encountered this problem when the chil-dren shout. The room asks for the kids to shout the magicword. The kids shout but not loudly. The room then re-sponds with, "Try that shout one more time." Initially wefelt that kids like to scream, so we would encourage themdo it twice. The problem was that the room responded withambiguous narration. Are they doing it again because itwasn’t good enough or for some other reason? Worse, ifnobody shouted anything, the narration mildly suggestedthat the room actually heard a shout. As we improved thesystem, we added hints and new narration to try and ensurethat when the system knows something (e.g. how loud scream is) it lets the participants know that it knows. Theeffect is a room that feels more responsive. The downside,however, is that the amount of narration required is at leastthree times what it was before.

Similarly, there are times when the system is dealing withuncooperative participants and, despite several attempts, hasnot elicited the desired activity. This occurs, for example, ifsomeone refuses to get on the bed in the bedroom or mon-ster worlds even after the mother is heard telling them todo so three times (with three different narrations). In thiscase, we decided to have the narration move on anyway,ignoring the trouble-maker. However, it is important toexplicitly acknowledge that the system understands some-thing is wrong but is ignoring the problem. One examplein the KidsRoom is when nobody shouts "Be quiet" to themonster. The narrator says, "Well, I can’t say that was avery loud shout, but perhaps the monsters will figure it out."However, currently the KidsRoom does not have a similarnarration if someone refuses to get on the bed. Hence, thestory moves on without the room explicitly indicating that

it knows someone is still off the bed, and an opportunity todemonstrate responsiveness is wasted.

Perceptual expectation

Given the technological limitations on unencumbered sens-ing in immersive environments like the KidsRoom, the nar-rative must be carefully designed so that it does not setupany perceptual expectations that cannot be satisfied. Forinstance, use of some speech recognition in the KidsRoommight prove problematic. If a child or adult sees that theroom can respond to one sentence, the expectation may beestablished that the characters can understand any utterance.Any immersive environment which encourages or requirespeople to test the limits of the perception system is morelikely to feel more broken and unresponsive than immer-sive.

The KidsRoom is not entirely immune to this problem, butwe believe we have minimized the "responsiveness testing"that people do by making the system flexible to the typeof input it receives (e.g. in the boat scene most any largebody movement will be interpreted as rowing) and by havingcharacters in the story essentially teach the participants whatthey can and cannot recognize (i.e. the allowed dance movesin the monster land).

Summary and contributionsThe KidsRoom went from whiteboard sketches to a fullinstallation in eight weeks. We believe the KidsRoom isthe first perceptually-based, multi-person, fully-automatedinteractive, narrative playspace ever constructed, and theexperience we acquired designing and building the spacehas allowed us to identify some major questions and topropose a few solutions that should simplify construction ofmore complex spaces in the future.

We believe the KidsRoom provides several fundamen-tal contributions. First, is the demonstration that non-encumbering sensors can be used for the measurement andrecognition of individual and group action in a complex in-teractive narrative. Second, unlike most previous interactivesystems, the KidsRoom does not require the user to wearspecial clothing, gloves or vests, does not require embeddingsensors in objects, and has been explicitly designed to allowmultiple simultaneous users. Third, the KidsRoom movesbeyond just measurement of position towards recognition ofaction using measurement and context - particularly con-text generated by the system itself. Finally, we believethe KidsRoom is a unique and fun children’s environmentthat merges the mystery and fantasy of children’s storiesand theater with the spontaneity and collaborative nature ofreal-world physical play.4

4The KidsRoom owes much of its success to the fact that itsconcept and design was developed, collaboratively, by all ofthe authors. Aaron Bobick acted as advisor and coordinator.Stephen Intille (interaction control; production issues) and JimDavis (vision systems) were the chief architects of the sys-tem. Freedom Baird, with help from Arjan Schtitte, wrote theoriginal script and handled most narrative recording. Claudio

15

ReferencesBobick, A.; Intille, S.; Davis, J.; Baird, E; Camp-bell, L.; Ivanov, Y.; Pinhanez, C.; Schutte, A.; andWilson, A. 1996. The KidsRoom: A perceptu-ally-based interactive and immersive story environment.M.I.T. Media Laboratory Perceptual Computing Sec-tion 398, M.I.T. Media Laboratory Perceptual Com-puting Section. Revised September 1997, see alsohttp://vismod.www.media.mit.edu/vismod/demos/kidsroom.Broderbund Software. 1994. Myst. An interactive CD-ROM.

Coen, M. 1997. Building brains for rooms: designingdistributed software agents. In Proc. of the Conf. on In-novative Applications of Artificial Intelligence, 971-977.AAAI Press.

Cruz-Neira, C.; Sandin, D.; and DeFanti, T. 1993.Surround-screen projection-based virtual reality: The de-sign and implementation of the CAVE. In Proceedingsof SIGGRAPH Computer Graphics Conference, 135-142.ACM SIGGRAPH.

Davenport, G., and Friedlander, G. 1995. Interactivetransformational enviromnents: Wheel of life. In Barrett,E., and Redmond, M., eds., Contextual Media: Multime-dia and Interpretation. Cambridge, MA USA: MIT Press.chapter 1, 1-25.Davis, J., and Bobick, A. 1997. The representation andrecognition of action using temporal templates. In Proc.Computer Vision and Pattern Recognition, 928-934. IEEEComputer Society Press.

Druin, A., and Perlin, K. 1994. Immersive environments:a physical approach to the computer interface. In HumanFactors in Computing Systems ( CHI), 325-326.

Intiile, S.; Davis, J.; and Bobick, A. 1997. Real-timeclosed-world tracking. In Proc. Computer Vision and Pat-tern Recognition, 697-703. IEEE Computer Society Press.Ivanov, Y.; Bobick, A.; and Lui, J. 1998. Fast lightingindependent background subtraction. In IEEE Workshopon Visual Surveillance - VS’98, 49-55.Maes, P.; Pentland, A.; Blumberg, B.; Darrell, T.; Brown,J.; and Yoon, J. 1994. Alive: Artificial life interactivevideo environment. Intercommunication 7:48--49.

Pinhanez, C., and Bobick, A. 1997. Human action de-tection using PNF propagation of temporal constraints.Technical Report 423, M.I.T. Media Laboratory Percep-tual Computing Section, 20 Ames Street, Cambridge, MA02139.

Pinhanez was in charge of music and light design and control.Lee Campbell worked on sound control and related issues. Yurilvanov and Andrew Wilson wrote the animation control code.Composer John Klein wrote the original KidsRoom score, andartist Alex Weissman created the computer illustrations.

16

design decisions for interactive environments: evaluating ...€¦ · ers designing such...

Documents