vlas: a vision-language-action integrated system for ... · vlas: a vision-language-action...

7
VLAS: A Vision-Language-Action Integrated System for Mobile Social Service Robots Kyung-Wha Park 1 , Jin-Young Choi 2 , Beom-Jin Lee 3 , Chung-Yeon Lee 3 , Injune Hwang 3 , Byoung-Tak Zhang 1, 2, 3, 4 1 Brain Science Program, 2 Cognitive Science Program, 3 Department of Computer Science and Engineering, Seoul National University, 4 Surromind Robotics {kwpark, jychoi, bjlee, cylee}@bi.snu.ac.kr, [email protected], [email protected] Abstract Due to the mobility and attractive appearance, re- cent mobile social service robots are suitable for performing human-robot interactive roles as re- quired for domestic service robots, such as a waiter at restaurants or an elderly companion at home. However, commercialized service robots are not equipped with high-end devices and are hard to retrofit, so they are more vulnerable to performance and device extend-ability than custom robots in lab- oratories. We propose Vision-Language-Action in- tegrated System (VLAS) that can alleviate these hardware weaknesses by applying advanced ma- chine learning techniques to leverage integrated high-level information; likewise, humans also well use imperfect physical components based on multi- sensory perception. We setup a pseudo-home en- vironment and a home party scenario to evalu- ate VLAS system quantitatively and qualitatively. The experimental results showed VLAS system ro- bustly operated various sub-tasks in the home party scenario with a social service commercial robot. 1 Introduction Social service commercial robots (SSCR), which were only available through laboratories or broadcast media, are now being released by many companies every year with developed products. They actually deal with customers in a variety of environments, such as airports, restaurants, hotels, and so on. The advantage of a commercial robot is that it is not an expert in the field of hardware and control (manipulation), but the artificial intelligence researchers can easily experiment with robots in a real-world environment. In particular, Pepper, a children-size mobile SSCR, depicted in Figure 1, is suitable for HRI experiments, because it is user-friendly in appear- ance and offers natural, animated gestures, with the API pro- vided by the manufacturer. Robot action parts that rely on hardware actuation can also be accomplished with the API. However, it is still difficult to operate robustly in a real en- vironment, because of problems, such as speech recognition failure, vibration during robot action, as well as dynamic ob- stacles, such as humans, complex maps, etc. An SSCR that is usually deployed in the real-world is like a vending machine, Figure 1: Overview of the VLAS system. just standing there and focusing on its tasks. The problems of SSCR can be divided into three sections – vision, language, and action. The problem of action, which is represented by navigation, is usually solved by using high-performance sen- sors, but it is an option that cannot be used for a commercial robot, because it is difficult to retrofit. Therefore, navigating related parameters should be searched for and applied to each different robot platform and operating environment. It should be noted that these parameters are related to other problems; for example, there are vision-related tasks, such as object de- tection. When the robot moves to find the target, the camera shakes because of the vibration generated from the body, and the detection performance is reduced rapidly. To overcome this, a model that is applied for vision tasks with a very high frame rate and excellent detection performance is needed, and a system seamlessly connected with the action is required. Likewise, speech recognition (language) requires not only an algorithm with superior performance but also actions, such as social-related gesture, which are necessary for the complete- ness of natural interactions with the robot. Moreover, many tasks require more than one module, and because there are many exceptions that cannot be verified without experiment- ing in a real environment, we need to identify scenarios in which social robots are deployed in a social space that inter- acts with people. In this paper, we propose VLAS 1 for SSCR to handle mul- tiple social-related tasks. To accomplish the tasks, We applied 1 Demonstration videos: https://youtu.be/Ay4LunyZVLk

Upload: others

Post on 07-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VLAS: A Vision-Language-Action Integrated System for ... · VLAS: A Vision-Language-Action Integrated System for Mobile Social Service Robots Kyung-Wha Park1, Jin-Young Choi2, Beom-Jin

VLAS: A Vision-Language-Action Integrated Systemfor Mobile Social Service Robots

Kyung-Wha Park1, Jin-Young Choi2, Beom-Jin Lee3, Chung-Yeon Lee3, Injune Hwang3, Byoung-Tak Zhang1,2,3,4

1 Brain Science Program, 2 Cognitive Science Program,3 Department of Computer Science and Engineering, Seoul National University, 4 Surromind Robotics

{kwpark, jychoi, bjlee, cylee}@bi.snu.ac.kr, [email protected], [email protected]

AbstractDue to the mobility and attractive appearance, re-cent mobile social service robots are suitable forperforming human-robot interactive roles as re-quired for domestic service robots, such as a waiterat restaurants or an elderly companion at home.However, commercialized service robots are notequipped with high-end devices and are hard toretrofit, so they are more vulnerable to performanceand device extend-ability than custom robots in lab-oratories. We propose Vision-Language-Action in-tegrated System (VLAS) that can alleviate thesehardware weaknesses by applying advanced ma-chine learning techniques to leverage integratedhigh-level information; likewise, humans also welluse imperfect physical components based on multi-sensory perception. We setup a pseudo-home en-vironment and a home party scenario to evalu-ate VLAS system quantitatively and qualitatively.The experimental results showed VLAS system ro-bustly operated various sub-tasks in the home partyscenario with a social service commercial robot.

1 IntroductionSocial service commercial robots (SSCR), which were onlyavailable through laboratories or broadcast media, are nowbeing released by many companies every year with developedproducts. They actually deal with customers in a variety ofenvironments, such as airports, restaurants, hotels, and so on.The advantage of a commercial robot is that it is not an expertin the field of hardware and control (manipulation), but theartificial intelligence researchers can easily experiment withrobots in a real-world environment. In particular, Pepper, achildren-size mobile SSCR, depicted in Figure 1, is suitablefor HRI experiments, because it is user-friendly in appear-ance and offers natural, animated gestures, with the API pro-vided by the manufacturer. Robot action parts that rely onhardware actuation can also be accomplished with the API.However, it is still difficult to operate robustly in a real en-vironment, because of problems, such as speech recognitionfailure, vibration during robot action, as well as dynamic ob-stacles, such as humans, complex maps, etc. An SSCR that isusually deployed in the real-world is like a vending machine,

Figure 1: Overview of the VLAS system.

just standing there and focusing on its tasks. The problems ofSSCR can be divided into three sections – vision, language,and action. The problem of action, which is represented bynavigation, is usually solved by using high-performance sen-sors, but it is an option that cannot be used for a commercialrobot, because it is difficult to retrofit. Therefore, navigatingrelated parameters should be searched for and applied to eachdifferent robot platform and operating environment. It shouldbe noted that these parameters are related to other problems;for example, there are vision-related tasks, such as object de-tection. When the robot moves to find the target, the camerashakes because of the vibration generated from the body, andthe detection performance is reduced rapidly. To overcomethis, a model that is applied for vision tasks with a very highframe rate and excellent detection performance is needed, anda system seamlessly connected with the action is required.Likewise, speech recognition (language) requires not only analgorithm with superior performance but also actions, such associal-related gesture, which are necessary for the complete-ness of natural interactions with the robot. Moreover, manytasks require more than one module, and because there aremany exceptions that cannot be verified without experiment-ing in a real environment, we need to identify scenarios inwhich social robots are deployed in a social space that inter-acts with people.

In this paper, we propose VLAS1 for SSCR to handle mul-tiple social-related tasks. To accomplish the tasks, We applied

1Demonstration videos: https://youtu.be/Ay4LunyZVLk

Page 2: VLAS: A Vision-Language-Action Integrated System for ... · VLAS: A Vision-Language-Action Integrated System for Mobile Social Service Robots Kyung-Wha Park1, Jin-Young Choi2, Beom-Jin

Figure 2: Flowchart of the cocktail party experiment. Three-striped rectangles describe three major task of the scenario. The four coloredshapes represent four modules. There is confirming and exhaustive looping for each module but omitted for readability.

state-of-the-art machine learning methodologies that are wellsuited to the machine specification and that deliver optimalperformance. The system has proved its robustness and speedby experiments in an actual home situation. We have releasedthis system’s source code and sub-module dataset as an opensource to contribute to SSCR research. Using this system, wesucceeded in solving very complex scenarios in the pseudo-home environment, and we were able to score high points ina robot competition that competes within a time limit withsimilar scenarios.

2 Related WorksDeveloping integrated systems of different methodologiesfrom many different research areas is especially crucial forthe development of multi-modules robotic systems [Brooks,1986] [Maes and Brooks, 1990]. Nenas et al. proposed“CLARAty (the coupled layered architecture for robotic au-tonomy),” which is a domain-specific robotic architecturethat focuses on the mediation of the overhead of developingcustom robotic infrastructure, e.g., middleware. The workfocuses on integrating components from different hardwareplatforms [Nesnas et al., 2003]. Zender et al. introduced the“CoSy Explorer” system: cross-modal integration, ontology-based mediation, and multiple levels of abstraction of percep-tion [Zender et al., 2007]. They studied combining semanticknowledge and description logics to facilitate general reason-ing capabilities for a custom robot.

It is said that today, robotic systems obtain even moreabilities through the combination of different methodologiesfrom different areas [Siepmann et al., 2014]. Sepmann et al.achieved success in the domestic service robot league usingtheir proposed building block style integrated system. Thesystem is characterized by a reusable design of multiple mod-ules required for each skill such as path planning, and obsta-cle avoidance, etc.

Previously, research has been done on a system that com-bines low-level sensor data and decision-making perceptionfor self-produced robots or custom robots so-called domes-

tic service robots [Wachsmuth et al., 2015]. The commer-cial platform robots that emerged as a result of previous stud-ies have relieved the difficulty of low-level manipulation andemphasized human-robot interaction and socialization. Usingsmall-size social robots designed for children, Gordon et al.have developed a system that learns emotional reactions thatoccur as a reward from social robot interaction in the educa-tion of preschool children [Gordon et al., 2015]. Recently,studies using SSCR of children-size with mobile capabilityhave been conducted [Lee et al., 2018], but it is difficult tofind other than this. This is because it is hard to perform vari-ous tasks with a robot having sensors that cannot be modified.For example, if navigation performance is poor, a better lasersensor can solve the problem, but SSCR is vulnerable to sucha problem and should be solely solved by the system. In gen-eral, there is no machine installed in the SSCR to performintensive calculations, so there is a limit to applying the latestdeep learning models required for vision and aural implemen-tation, which is essential for social-related tasks and reinforc-ing combining the sensors’ outputs. Since this paper assumesa home environment, we have solved these problems using aremote GPU server. We developed the system by choosingthe deep learning model that best suits the tasks to be solved.

3 System OverviewIn the subsection below, we describe the VLAS that we pro-pose at a modular level. Figure 1 shows the overall struc-ture of the system, and Figure 2 is a flowchart of a cocktailparty scenario. The four colored shapes in Figure 2 repre-sent the four modules (vision, language, action, and percep-tion) existing in the system. These four shapes are depictedto show that the modules are used in combination. Also inthe figure, there are three main tasks, and these tasks consistof multiple sub-tasks. These sub-tasks refer to one possibletask on a module-by-module basis. For example, there is a”taking the order” task in the first rectangle next to the startarrow, which performs about 12 sub-tasks starting with per-son detection. The sub-task is depicted with a grey, rounded

Page 3: VLAS: A Vision-Language-Action Integrated System for ... · VLAS: A Vision-Language-Action Integrated System for Mobile Social Service Robots Kyung-Wha Park1, Jin-Young Choi2, Beom-Jin

rectangle named ”rotating.” It represents that the sub-task isperformed simultaneously with the other tasks. The outputof each module is integrated to enable decision making andcoordination among the modules in the perception module.In this system, all the low-level information is stored in theperception module, and the robot uses it to become aware ofthe current situation. In the following subsections, we cate-gorize the sub-tasks and explain the required capabilities foreach module. Later, we explain the rationale of the machinelearning methodologies we apply to the system.

3.1 Vision ModuleThe sub-tasks to be performed by the vision module in anindoor social interaction environment are as follows: i) Ob-ject/person detection/description, ii) Person identification, iii)Object/person counting, and iv) Human pose detection. Per-son detection is the most basic sub-task of social robots; itclassifies the appearance of a person according to imagestaken from the monocular camera. We need a visual detectionmodel to simultaneously classify multiple targets, includingpeople and objects, in the perceived scene. The model shouldalso show robust performance in human posture.

Various objects, as well as people, need to be recognized,and a large number of labels should be used in classification.The typical labels we considered were person, table, chair,sofa, shelf, bottle, cup, TV, and gender. There is a need for amachine learning model that describes human-readable sen-tences of classified objects/people. For example, ”The manwearing a black shirt and white shorts is standing next to thesofa.” This is useful when you need to explain a situation toa person or describe the appearance of a guest who ordered ameal at a restaurant. It is also useful when we need to find thecustomer again for the person identification sub-task.

Assuming there was a cocktail party at home, the robotmust perform posture recognition to find a guest with a raisedhand or to find a seated guest. In this system, we implementeda deterministic algorithm that recognizes joint positions andclassifies poses according to their positions so that the twoposes can be classified.

Since there is an approaching task that requires vision andaction to identify and locate a specific person, it is necessaryto calculate the distance from the recognized person on thescreen. In the case of a robot platform with an RGB-D cam-era, this step can be more easily achieved by distance mea-surement.

Based on the above description, the output that the visionmodule will pass to the perception module is as follows: i) Allrecognized labels on the screen, ii) Recognize the person’sdescription/identification/joint position information, and iii)Position on the screen of a person sitting or standing/distancemeasured by RGB-D. Multiple labels should be checked inreal-time with bounding boxes.

The above mentioned typical objects are not enough for therobots to operate in a real environment, so we implementeda histogram-based image matching process (HIP) to developa visual module capable of adding labels. For example, thereis a label ”bottle” but no label like ”coke can.” A coke can isrecognized as ”bottle” label in the visual module. To solvethis problem, a coke can is photographed by a robot at vari-

Algorithm 1: Sub-modular function in the visual module.1 function Find People

(P,C ≡ [attr1 : val1, attr2 : val2, ...], distance)Input : P : data from the perception module. C:

conditions regarding to attributes. distance(z):range limit

Output: p2 for p in P do3 if p.attr == ′person′ AND [p.attr == C.attr for

attr in C] AND4 x− distance ≤ v.locationx ≤ x+ distance AND5 y − distance ≤ v.locationy ≤ y + distance then6 result.append(p)7 end8 end9 return result

ous angles, and a histogram of images is extracted, and therethe average values are learned. An image that is recognizedby the visual module first as ”bottle” is determined by thepost-processing label (”coke can” or ”orange juice”) whoseaverage value of the histogram is closest to the image.

The algorithm below describes how the vision module co-operates with the perception module to solve a sub-task suchas find a person. All information entered in the perceptionmodule has the form ’(attributes : values)’. For example, theoutput of the posture deterministic model is passed to the per-ception module as (’pose’ : ’sitting’). Also, description infor-mation or positional information depicting the human figureon the screen is stored in the memory module under the nameP in the algorithm. The find people function is the functionthat finds the information that matches the condition in thedata stored in P . Here, condition (C) has a form similar to P ,and acts as a keyword to recall memory in P . In this scenario,for example, if you are looking for a person named Jameswho has ordered a drink, C would be ’{’name’ : ’James’,’location x’ : ’0.5’, ’location y’ : ’0.3’}’ (where ’location’is the coordinate value of where the previous order was re-ceived). ’distance’ is a condition that is sometimes used toprevent the search range from becoming too wide. The countfunction counts the number of times C is recognized in P .The following algorithm shows an example of the algorithmused throughout the system.

YOLO9000 [Redmon and Farhadi, 2016] was applied tothe person/object detection and DenseCap [Johnson et al.,2016] was applied for post-processing and obtaining a de-scription of recognized persons/objects. For person identifi-cation, we refer to Liao’s study [Liao et al., 2015]. Open-Pose [Cao et al., 2017] was applied to the joint informationrequired for waving/sitting detection.

3.2 Language ModuleThe language module is mainly responsible for communica-tion with people, and speech recognition is the main sub-task.Since the robot operates autonomously after starting the sce-narios, the user intervention is minimized and the task is per-formed, by listening to the command statements. In some

Page 4: VLAS: A Vision-Language-Action Integrated System for ... · VLAS: A Vision-Language-Action Integrated System for Mobile Social Service Robots Kyung-Wha Park1, Jin-Young Choi2, Beom-Jin

scenarios, the command statement will be complicated. Forexample, ”Go to the kitchen, find John and tell him your orderis not available.” It is important for the robot to recognize thespeech quickly, accurately, and versatilely. The reason for theshort response time is not only the time limit for some com-petitions, but also the satisfaction of human-robot interaction.This language module is not limited to the cocktail party sce-narios and thus requires a general purpose in the robot com-mands. We have implemented a encoder-decoder model asa sub-module that converts colloquialisms into robotics com-mands, to achieve versatility. These results will be shown ina later section.

Autonomous speech recognition, based on dynamiccontextual wordsWhen operating in a place with many people, speech recog-nition often fails, due to ambient noise. In our system, notonly the threshold that operates as a de-noising filter is ap-plied, but also the context is considered. For example, whena robot receives an order from a restaurant, almost all of thewords need to be recognized for the menu. Therefore, byproviding a menu list as context words, the search space canbe reduced to improve performance. This can be expressedas follows: P (x|c1, c2, . . . , cn). x is the target sentence tobe recognized, and cn is the context, which can be words orplaces. Currently, we have implemented it as a determinis-tic method to directly input context words as initial research.Context words consist of frequently used names, objects, andmenus used in scenarios, words for calls and interrupts (e.g.,”Pepper!” and ”Stop!”), and words for social conversation(e.g., common questions). Experimentation on giving con-text words all at once (static) and dynamically, depending onthe situation, will be done in the experiment section.

We wanted to apply the machine learning model as muchas possible, as long as the machine’s capabilities allow all thesub-tasks mentioned above. Since the vision module’s GPUusage will be dynamic and heavy, speech recognition uses anexternal API so that it does not burden the machine.

The information passed from the language module to theperception module is a text sentence that is a speech-to-textresult. This is parsed by a perception module to determinewhether it is a command statement or a social dialogue. Ac-cording to the result, the robot performs the command or per-forms the predetermined social conversation.

Colloquial command translating sub-moduleBefore the perceived text was passed to the perception mod-ule, we applied an additional sub-module that would be ro-bust to the vocabulary used in real life. In general, when weissue a command to a home robot, we will say a sentence thatmatches the predefined command set and syntax. These com-mands are far from colloquialisms. To increase the numberof vocabulary and sentences that can be handled by the homerobot, encoder-decoder model-based machine translation wasapplied. This neural network model, named Colloquialisms-to-Command(Coll2Comm) is based on Seq2seq [Sutskever etal., 2014], which collects and trains 20k pairs of data thattranslate different statements from the form of statementsused in general-purpose robot commands to daily colloqui-alisms. We have collected data that ”translates” an explicit

robot command statement (”go to the kitchen, find fruit, andbring it to me”) into multiple colloquial sentences (”get mea fruit from the kitchen,” etc.). The dataset, as well as thesource code of this system, will be open to the public.

3.3 Action ModuleThe main role of this module is to navigate the robot in-doors. We integrate sensor data of the commercial robotsbased on ROS [Quigley et al., 2009] to perform navigation(mapping, localization, and planning). Because the hardwarecannot be modified, the system becomes unstable unless wefine-tune the navigation-related parameters. There are vari-ous navigation-related parameters, but our experimental re-sult shows that the most important parameters are as follows:i) Map inflation - This parameter increases the size of the ob-stacle on the map. Increasing this parameter will help avoidobstacles, but if it is too large, it will be perceived as havingno gaps, and navigation planning will fail. ii) Robot radius -This parameter specifies the radius of the robot. Controllingthe size of the robot plays a similar role as the map infla-tion. iii) Max velocity - This parameter determines the maxi-mum speed of the robot. Increasing the task execution time isshorter, but it becomes unstable in the localization that graspsits position. iv) Planning weight - a parameter that determinesthe degree of agreement between the distance to the goal andthe global plan when determining local planning for short dis-tances. If the environment is complicated, navigating to theshortest path will fail to reach the target, because it cannot es-cape between the obstacles. v) Reflex speed: When the robotmeets an obstacle, it stops navigation and returns to its pre-vious direction, to avoid obstacles. This is the parameter thatindicates the distance to return. If this parameter is too large,the robot will not be able to go the narrow path; if it is toosmall, the obstacle avoidance will fail. For these parameters,we determined the optimal parameters by a heuristic method.A detailed study of the navigation-related action module wasmade in Lee’s study [Lee et al., 2018]. In the Experimentssection that follows, we will compare the value set for thissystem, to the value sets for all of these parameters by de-fault. In addition, integrating experiments with vision mod-ules will show that this system solves complex problems thatoccur while they are performing tasks in a real environment.

The sub-tasks that are implemented, based on navigation,as described above, are as follows: i) Go to Waypoint, ii)Approach to person, iii) Social action/gesture.

In this study, we experimented with animating a gestureand a social-related action, during a conversation. We de-veloped a language module to animate the appropriate ges-ture, while the conversation is in progress. This was buildas a rule-based model, and demonstrated animation of a pre-defined gesture appropriate to the context of the sub-task. Inthe Experiments section, we will show the user satisfactionaccording to the robot gesture as a quantitative evaluation.

3.4 Perception ModuleThis is a module that accepts data processed in the previousvision, language, and action module, decision making, and is-sues commands to the module. This study is an initial study,

Page 5: VLAS: A Vision-Language-Action Integrated System for ... · VLAS: A Vision-Language-Action Integrated System for Mobile Social Service Robots Kyung-Wha Park1, Jin-Young Choi2, Beom-Jin

Figure 3: A psuedo-home environment for SSCR experiments.

which integrates data from multiple modules and makes deci-sions with a deterministic model and rule-based algorithms.For this module, Choi’s study 2 has been referred to and isbeing disclosed as an open-source API.

3.5 Memory ModuleMost of the data input to the perception module is stored inthe memory. The current system stores information repeat-edly used in tasks of language modules, person descriptioninformation, context words, and map information needed forrobot action.

4 ExperimentsIn this section, the experiment is performed in the lab assum-ing the home environment, as shown in Figure 3. The envi-ronment is 10 meters * 7 meters and was built for this study.In all experiments, the system was running on a remote GPUmachine. The machine specs are i7 7700, 32GB RAM, TitanXp 12GB VRAM, and Linux for OS.

4.1 Vision/Action Experiments

Table 1: Performance results according to sub-tasks of the visionmodule

Subtasks Avg. response time(sec) Max. # of detected targets

Obj. detection 0.06 20+ objectsDescription 0.05 5 objects

Waving/Sitting detection 0.05 5+ people

First, we conducted experiments on the vision modulealone. The result is shown in Table 1. This is the result oftesting three sub-tasks in a lab environment. The experimen-tal platform, Pepper, is equipped with a camera with a max-imum FPS (frames per second) of 30. The resolution thatcould keep it stable was 320*240. However, considering thecommunication speed, we lowered the fps in this system. Thevision module operated at 15 fps in a lab environment withstable communication speed but at 8 fps in a public environ-ment where communication speed was not stable.

The average response time in the table was tested by in-putting a single image into the system and measuring the time

2https://github.com/gliese581gg/IPSRO

Figure 4: Results of the vision/action module integration.

until an output was obtained. The column in the table namedthe maximum number of detected target shows the sub-taskmaximum outputs from the inputted image. It demonstratesthat almost all of the detected targets captured in 320*240 res-olution can be processed. The number of detected targets forthe description sub-task is 5 because the sub-task is only usedto describe the top 5 boxes with the highest confidence amongthe bounding boxes that are classified as persons. The aver-age response time results show that the model is fast enoughto process within 15 fps (0.067 seconds per frame) in the labenvironment.

To show the robustness of HIP in terms of various real-world objects, we prepared 26 objects for the vision module:bowl, plate, soup container, body cleanser, hair spray, mois-turizer, shampoo, chopsticks, fork, spoon, Gatorade, Coke,canned coffee, green tea, apple, orange, bread, corn, onion,radish, candy, chewing gum, 2 kinds of cup noodles, potatochips, and jelly. About 85% of them can form boundingboxes; the four which can’t are plate, bread, chewing gum,and corn. These four objects were too small or specific inshape to form a bounding box. Also, most of the objectsthat successfully formed a bounding box and were detectedas broad labels had to be changed. For example, radish wasrecognized as a cup in the vision module, but its label wasmodified and redirected as ’radish’ by the HIP. As a result,22 objects could be recognized as individual labels. This re-sult shows its robustness toward various real-world objectsfor the vision module alone. Second, we conducted experi-ments on the vision and action modules integrated. Detectionsub-tasks are usually performed during a move, so an actionmodule and a vision module are needed at the same time. i)Object detection sub-task with changing rotation speed (theaction module): Ten of the 26 objects mentioned above areplaced on the desk, and the Pepper rotates at a constant speed(degree/sec) in front of the desk to see if object recognitionis maintained. The experiment was conducted by changingthe rotation speed of the robot, as in the x-axis in Figure4. The table shows the average of the percentage of objectsrecognized per trial. It was performed 20 times per rotatingspeed parameter. ii) Navigating sub-task with changing ro-tation speed: we performed 20 navigating trials to visit fivewaypoints in order in the laboratory and counted the numberof successful trials.

In Table 1, it shows the object detection and navigating per-formance according to the robot rotation speed. The slowerthe rotation speed, the higher the success rate of the detec-tion/navigating sub-task, but eventually, it slows the execu-tion time of the whole scenario. It can be seen that the per-

Page 6: VLAS: A Vision-Language-Action Integrated System for ... · VLAS: A Vision-Language-Action Integrated System for Mobile Social Service Robots Kyung-Wha Park1, Jin-Young Choi2, Beom-Jin

Figure 5: Results of the language/action module integration.

formance is rapidly reduced when the rotation speed is higherthan a certain level. It is necessary to find the trade-off pointbetween the vision module and the action module, and thendetermine the rotation speed. For example, if the SSCR isupgraded, and the visual/action module gets faster or better,then increasing the rotation speed will reduce the overall timeto solve the scenario.

4.2 Language/Action ExperimentsThe figure shows the speech recognition and gesture-involvedresults. In the lab environment, we asked the subjects to readthe 20 sentences toward Pepper and receive its response sen-tences with gestures. Speech recognition accuracy was cal-culated by counting the number of hits that one sentence wassuccessfully recognized within three attempts. The averagetime consuming means the average of the time consumedwhen the same speech recognition is attempted without lim-iting the number of trials. At the end of each experimentaltrial, we surveyed the subjects’ satisfaction with robot in-teraction; a five-point scale questionnaire about the robot’sspeech recognition performance exceeding three points wasregarded as satisfactory. The subjects were 20 males and 10females; all of whom were Asians. We divided the numberof subjects by half, conducted an interaction experiment withand without gestures, and independently surveyed their satis-faction. The gesture animated one of the randomly predefinedones in Pepper API every time it recognized a speech. For ex-ample, gestures, such as clapping, shaking hands, and so onwere animated.

We set up four experiments: i) default - no context words,ii) static context words - provided all at once, iii) dynamiccontext words (DCW) - change the words dynamically, de-pending on the situation, and iv) giving a time delay of 2 sec-onds for comparison with the DCW setting. We determinedthe experiment setup, assuming that the response time differ-ence would be related to the interaction satisfaction.

As a result, it can be seen that as the accuracy of speechrecognition increases, the required time is reduced, and thesatisfaction of the interaction increases. Even if the accu-racy is high, the satisfaction decreases dramatically when thereaction time is prolonged. Satisfaction increases when thegesture enters the middle of the interaction, but the tendencydisappears as the response time increases.

As a result of the quantitative evaluation of the Coll2Commsub-module, it was confirmed that BLEU [Papineni et al.,

Figure 6: Cocktail party scenario execution time.

2002], which is a quantitative evaluation scale in machinetranslation, leads the baseline statistical machine translationto 66.3 to 87.3. Also, the vocabulary size of the executablecommanding statement is 144 in the vanilla API, whereas thissub-module can cope with 616-word combinations that arefive times as large.

4.3 Modules Integrating ExperimentsFigure 6 shows the time spent by performing the cocktailparty scenario by adding each module. Although the threemodules are integrated to perform the scenario, the executiontime of the scenario differs greatly when the modules in thedefault state are integrated. The default state of each moduleis as follows: i) Vision module: It performs sub-tasks usinga person detection–related function of NaoQi API; a basicAPI is provided by the manufacturer of Pepper. ii) Languagemodule: Context words are not used. iii) Action module: Thefive parameters mentioned above are set as initial values. Asa result, the execution time of experiments to set the visionmodule to an initial value increases sharply. The best result isthat all module settings are optimal. However, much researchis still needed to catch up with human performance.

5 ConclusionIn this paper, we propose an integrated system that overcomesthe hardware weakness of a mobile SSCR in a home en-vironment where socializing with people is important. Thevision-language-action integrated system has been proposedand proved its performance through various experiments. Thesystem code was also released to contribute to the develop-ment of the SSCR. However, many parts of the system areimplemented as a deterministic model to make the system op-erate in a real environment. As a future works, it is expectedthat the learning-based model will be able to substitute thedeterministic model which covers most of the current system.It is likely to be possible to introduce a reinforcement learn-ing framework to this system or to extend it to more com-plex robot action sequence planning using the Coll2commsub-module.

AcknowledgmentsThis work was partly supported by the Institute for Infor-mation Communications Technology Promotion (R0126-16-1072-SW.StarLab), Korea Evaluation Institute of IndustrialTechnology (10044009-HRI.MESSI, 10060086-RISF), andAgency for Defense Development (UD130070ID-BMRR)grant funded by the Korea government (MSIP, DAPA).

Page 7: VLAS: A Vision-Language-Action Integrated System for ... · VLAS: A Vision-Language-Action Integrated System for Mobile Social Service Robots Kyung-Wha Park1, Jin-Young Choi2, Beom-Jin

References[Brooks, 1986] Rodney A. Brooks. A Robust Layered Con-

trol System For A Mobile Robot. IEEE Journal onRobotics and Automation, 2(1):14–23, 1986.

[Cao et al., 2017] Zhe Cao, Tomas Simon, Shih-En Wei, andYaser Sheikh. Realtime Multi-Person 2D Pose Estimationusing Part Affinity Fields. In CVPR, 2017.

[Gordon et al., 2015] Goren Gordon, Samuel Spaulding,Jacqueline Kory, and Jin Joo. Affective Personalizationof a Social Robot Tutor for Children ’ s Second LanguageSkills. In AAAI 2015, 2015.

[Johnson et al., 2016] Justin Johnson, Andrej Karpathy, andLi Fei-Fei. Densecap: Fully convolutional localizationnetworks for dense captioning. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 4565–4574, 2016.

[Lee et al., 2018] Beom-Jin Lee, Jinyoung Choi, Chung-Yeon Lee, Kyung-Wha Park, Sungjun Choi, CheolhoHan, Dong-Sig Han, Christina Baek, Patrick Emaase, andByoung-Tak Zhang. Perception-action-learning system formobile social-service robots using deep learning. In AAAI,2018.

[Liao et al., 2015] Shengcai Liao, Yang Hu, Xiangyu Zhu,and Stan Z Li. Person re-identification by local maximaloccurrence representation and metric learning. In Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition, pages 2197–2206, 2015.

[Maes and Brooks, 1990] Pattie Maes and Rodney ABrooks. Learning to coordinate behaviors. In AAAI,volume 90, pages 796–802, 1990.

[Nesnas et al., 2003] Issa A Nesnas, Anne Wright, Max Ba-jracharya, Reid Simmons, Tara Estlin, and Won Soo Kim.Claraty: an architecture for reusable robotic software. InUnmanned Ground Vehicle Technology V, volume 5083,pages 253–264, 2003.

[Papineni et al., 2002] Kishore Papineni, Salim Roukos,Todd Ward, and Wei-Jing Zhu. Bleu: a method for auto-matic evaluation of machine translation. In Proceedings ofthe 40th annual meeting on association for computationallinguistics, pages 311–318. Association for ComputationalLinguistics, 2002.

[Quigley et al., 2009] Morgan Quigley, Josh Faust, TullyFoote, and Jeremy Leibs. ROS: an open-source Robot Op-erating System. In ICRA workshop on open source soft-ware, 2009.

[Redmon and Farhadi, 2016] Joseph Redmon and AliFarhadi. YOLO9000: better, faster, stronger. arXivpreprint arXiv:1612.08242, 2016.

[Siepmann et al., 2014] Frederic Siepmann, Leon Ziegler,Marco Kortkamp, and Sven Wachsmuth. Deploying amodeling framework for reusable robot behavior to enableinformed strategies for domestic service robots. Roboticsand Autonomous Systems, 62(5):619–631, may 2014.

[Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, andQuoc V Le. Sequence to sequence learning with neuralnetworks. In Advances in neural information processingsystems, pages 3104–3112, 2014.

[Wachsmuth et al., 2015] Sven Wachsmuth, Dirk Holz, andJavier Ruiz-del solar. RoboCup @ Home – Benchmark-ing Domestic Service Robots. In Proceedings of the 29thConference on Artificial Intelligence (AAAI 2015), pages4328–4329, 2015.

[Zender et al., 2007] Hendrik Zender, Patric Jensfelt,Oscar Mart\inez Mozos, Geert-Jan M Kruijff, andWolfram Burgard. An integrated robotic system forspatial understanding and situated interaction in indoorenvironments. In AAAI, volume 7, pages 1584–1589,2007.