cop-m — perception for mobile pick-and-place in human ...robot perception systems for manipulation...

COP-MAN — Perception forMobile Pick-and-Place in Human Living Environments

Michael Beetz, Nico Blodow, Ulrich Klank, Zoltan Csaba Marton, Dejan Pangercic, Radu Bogdan RusuIntelligent Autonomous Systems, Computer Science Department,

Technische Universitat MunchenBoltzmannstr 3, Garching bei Munchen, 85748, Germany

{beetz, blodow, klank, marton, pangercic, rusu}@cs.tum.edu

Abstract— While many specific perception tasks have beenaddressed in the context of robot manipulation, the problemof how to design and realize comprehensive and integratedrobot perception systems for manipulation tasks has receivedlittle attention so far. In this paper, we describe and discussthe design and realization of COP-MAN, a perception systemthat is tailored for personal robots performing pick-and-placetasks, such as setting the table, loading the dishwasher, andcleaning up, in human living environments. We describe ourapproach to decomposing and structuring the perception tasksinto subtasks in order to make the overall perception systemeffective, reliable, and fast.

Distinctive characteristics and features of COP-MAN includesemantic perception capabilities, passive perception and a knowl-edge processing interface to perception. The semantic perceptioncapabilities enable the robot to perceive the environment interms of objects of given categories, to infer functional andaffordance based information about objects and the geometricand part-based reconstruction of objects for grasping. Passiveperception allows for real-time coarse-grained perception ofthe dynamic aspects, and the knowledge processing interface toperception enables the robot to query the information it needs,which is then automatically acquired through active perceptionroutines.

I. INTRODUCTION

We investigate the realization of a household robot assis-tant, a mobile personal robot that can perform daily pick-and-place tasks in kitchen settings. The robot is to set the table, toload the dishwasher, and to clean up. We restrict ourselves tothe performance of pick-and-place tasks for rigid objects ofdaily use including cups, bottles, plates, and bowls. The pick-and-place tasks include actions such as opening and closingcupboards and drawers.

Fig. 1. Mobile manipulation platform for the household assistant. Thesensor-head mounted on the pan-tilt is depicted in the middle.

Our primary research goal is the achievement of generality,flexibility, reliability, and adaptability in everyday manipula-tion tasks. These properties are tackled in various ways. First,robots are enabled to install themselves in new environmentsby automatically acquiring a model of the static objects andstructures in the environments. Second, robots are equippedwith means to use abstract information from the World-wideWeb, such as models from the Google 3D Warehouse, imagesfrom search engines, and instructions from “how-to” webpages, as resources for both learning how to achieve newtasks but also for optimizing old ones.

The realization of such task achievement competenciesrequires that we equip robots with the necessary perceptualcapabilities. The robots have to detect, recognize, localize,and geometrically reconstruct the objects in their environ-ments in order to manipulate them competently. They haveto interpret the sensor data they receive in the context of theactions and activities they perform. For example, in order toget a cup out of the cupboard, the robot has to find the doorhandle to open the cupboard.

In this paper we outline the perception system COP-MAN (COgnitive Perception for MANipulation) which weare currently designing and implementing as the cornerstoneof everyday pick-and-place tasks in kitchen settings.

COP-MAN performs two primary perceptual tasks. First,it enables the acquisition of a model representing the staticpart of the environment. This model contains structural com-ponents of the environment such as: walls, floors, ceilings,doors, furniture candidates (e.g., cupboard, shelves, drawers),kitchen appliances, horizontal supporting planes in partic-ular tables, etc. Second, COP-MAN perceives manipulableobjects and dynamic scenes in the environment. This taskincludes scene interpretation, localization and recognition oftask-relevant objects, inference of the possible object roles,and the reconstruction of objects from partial views intomodels suitable for pick-and-place tasks.

The main contribution of this paper is the design andimplementation of a comprehensive perception system forrobot pick-and-place tasks for everyday objects in humanliving environments. Using the perception system the robot isroughly aware of the dynamic environment state — the thingson tables and kitchen counters without knowing exactlywhere they are, what they are, and what form they have.If needed the robot can classify, localize, and geometrically

reconstruct object hypotheses. In addition, the robot is ableto interpret scenes and infer missing or misplaced items ona set table.

The remaining of the paper is organized as follows: anoverview is given in the next section, followed by thedescription of the Semantic 3D Object Perception Kernel inSection III. The functional modules of COP-MAN are pre-sented in Section IV, and their adaptation to the environmentin Section V. We will conclude and discuss our future goalsin Section VI.

II. OVERVIEW ON THE PERCEPTION SYSTEM

Let us now give an overview of COP-MAN by first intro-ducing its sensor components and then giving a functionalview of the software system.

To perform its perception tasks, the robot is equipped witha suite of sensors (see Figure 1): (1) a tilting 2D laser scanneris mounted into the torso in order to provide 3D point clouddata for the parts of the environment in front of the robot. Thepoint cloud data is acquired continuously while the robot isin operation, and a dynamic obstacle map is constructed [1],to account for changes in the world and avert collisions ofthe robot with the environment. The second main applicationis to provide the data base for the interpretation of theenvironment state (see subsection IV-A). (2) a sensor headmounted on a pan-tilt unit includes a pair of high resolutioncolor cameras, a Time-Of-Flight camera providing coarsegrained and rather inaccurate, but fast 3D depth information,and finally a stereo-on-a-chip camera system providing fastbut low-resolution stereo image processing functionality.1

Object: 6Type: mugmugTable: 3Pos: (0.893,0.024)

Points: stereoModel: POINT CLASS

Object: 11Type: unknownunknownTable: 1Pos: (0.522,0.237)

Points: laserModel: GEOMETRIC

Object: 5 Type: tetratetra Table: 2

Pos: (0.479,0.132)

Points: pixelsModel: CAD

Static 3D Semantic Static 3D Semantic Object ModelObject Model

Acquisition and update of Acquisition and update of static environment modelsstatic environment models

Task-directed Task-directed object of scene object of scene

perceptionperception

Query Query componentcomponent

Passive scene Passive scene perceptionperception

Object: 5Type: cupcupTable: 2Pos: (0.479,0.132)

Points: laserModel: COMPLETED ROTATIONAL HYBRID

Dynamic Object Knowledge BaseDynamic Object Knowledge Base

Fig. 2. Block diagram of the perception system.

Given this sensor equipment the individual sensors takeover the following roles. The tilting laser scanner is our pri-

1We only address the sensors that are needed for perceiving objects andscenes. We do not discuss the sensors for navigation or for grasping andholding objects.

mary passive sensor, which provides the robot with continualupdates about the relevant regions of the kitchen, in particularthe table tops. The active sensor head is used for examiningparticular task-relevant regions of interest and for perception-guided manipulation.

The functional view of the perception system is depictedin Figure 2. We will first detail the main data structures andmodels used by the perception system and then describe therole of the functional modules operating on these models.

A. Data Structures and Models

The perception system feeds and uses two main modelbases. First, the static 3D semantic object model of theenvironment, which is displayed in the upper part of Figure 2and second, the dynamic object knowledge base, whichcontains geometric and appearance models and positionalinformation about the objects of daily use.

a) Static 3D Semantic Object Model: The static 3Dsemantic object model of the environment contains therepresentation of rooms and doors, structural parts such asceiling, walls, and floor, and the pieces of furniture includingcupboards, table tops, and appliances and their parts. Acupboard, for example, is represented as a cuboid containerwith a front door, a hinge, and a fixture used for opening it(e.g., handle).

The static environment model is generated automaticallythrough the mapping module described in Section IV-A. Theresultant model is then stored using a XML-based markuplanguage for CAD models of human living environments.The XML specification can then be used to generate envi-ronment models for 3D physics-based robot simulators suchas Gazebo [2].

The XML representation is encoded in a specific XMLextension called OWL (Web Ontology Language). OWL is adescription-logics based knowledge representation languagethat enables us to define taxonomies of concepts. For ex-ample, we define that a cupboard is a container and haswalls, floor, ceiling and a door as its structural parts. Besidesthe door body, the door has a handle and a hinge. Usingthe knowledge stored in the taxonomy, the robot can inferthat the purpose of the cupboard is to keep objects insidebecause a cupboard is a container. By stating that a perceivedentity is a cupboard, all assertions about cupboards and theirgeneralizations apply to the perceived entity and can bequeried by the robot.

b) Dynamic Object Knowledge Base: The dynamicobject knowledge base contains information about the objectsthat are to be manipulated by the robot — the objects on thetable, the ones on the counter, and the ones in the cupboardwhere objects have to be picked up (see Figure 3). The in-formation about the objects includes positional information,information about their shape, etc.

The information about objects is provided at differentlevels of abstraction. Objects can be represented as rawdata such as clusters of points, or as abstract geometricdescriptions such as a cylinder with a handle for example.

(a) (b)

(c) (d) (e)

Fig. 3. Dynamic table scenes (a and b) and different object representations:classified surface types (c), hybrid representation (d), and CAD model.

The object hypotheses represented with raw data are as-serted automatically by the perception system while abstractinformation is generated by actively applying sensor datainterpretation routines to the corresponding raw data.

B. Functional Modules

There are four functional components that operate on thestatic environment model and the dynamic object model base.We will briefly outline their role in the perception systembelow, and detail their description in Section IV.

The mapping system (Subsection IV-A) for the staticaspects of the environment takes laser range scans as its inputand computes a semantic object model of the environment.

The passive scene perception (Subsection IV-B) uses thecontinual scanning mode of the tilting laser scanner toupdate the dynamic object model base with object hypothesesextracted from the scans. The object hypotheses are stored asraw point cloud regions where each point in the cloud regionis expected to correspond to the same object or object group.

Task-directed object and scene perception (Subsection IV-C) serves two purposes. First, it computes the informationneeded by the control routines for deciding on the rightcourse of action (e.g., which grasp to take) and for inferringthe appropriate action parameterizations (where to put thecontact points of the fingers). The second purpose is toexamine the object hypotheses generated by the passive sceneperception in order to produce more informative and abstractobject and scene descriptions. The result of task-directedobject and scene perception is typically a refinement of therespective representations.

The last component of the perception system is the querycomponent. Using the query component we can send queriesto the static environment and the dynamic object model usingan interactive graphical user interface, as well as using in-terface routines that are provided by the perception system’sAPI. We can also query object properties that are not yetavailable in the model bases. In this case, the respectivequeries trigger active perception processes as described inSection IV-D.

Fig. 4. Interactive query interface for the static environment model. Theuser asked for the cupboards in the environment and the red boxes arereturned as the query result.

III. THE SEMANTIC 3D OBJECT PERCEPTION KERNEL

COP-MAN is implemented on top of the The Semantic3D Object Perception Kernel, which includes libraries ofdata structures and models for sensor data interpretation.Programmers can use the libraries in order to build their owndomain- and task-specific perception systems as processingpipelines that make use of the data structures and functionsprovided by the library.

A. Data Structures and Representations

The main data structures and models provided by theperception kernel are points and point clouds and their rep-resentations, in particular Point Feature Histograms [3] andvarious surface and volume representations. The remainderof this section will sketch these models and explain theirusage.

1) Point Clouds: represent the raw data structures pro-duced by range sensing devices. In COP-MAN we con-sider point clouds to be unorganized set of points pi ={xi, yi, zi} ∈ P , possibly including additional informationsuch as intensity, r, g, b values, etc. Their 3D positions arecomputed with respect to the origin of a fixed coordinatesystem, and their values are sampled on or near a surfaceM present in the real world. The purpose of point cloudinterpretation is to find models M′ that approximate orexplainM, and are informative for the robot control systems.

2) Representing Point Clouds: Our perception systemuses specific representations of points in point clouds thatenable and facilitate information extraction, surface recon-struction, and object recognition. The representation of pointsincludes information about the local surface neighborhood,whether the point together with its neighborhood is character-istic for important surface categories (e.g., “point on plane”),the role of the point in its surface neighborhood (e.g., edgebetween two planes), the information content of the point(e.g., a point on a plane in the middle of a plane is notinformative), whether points are distinctive with regards tofinding them in the point cloud, etc.

Some important requirements and objectives for pointrepresentations is that they are view independent, robustagainst noise, and very fast to compute.

3) Point Feature Histograms: The geometry of the pointclouds can be described locally by analyzing the differentconfigurations of surface normals in a surface patch. Byestimating the surface normals at each measurement pointbased on the neighborhood it forms with nearby points,four values can be measured between each two point-normalpairs (the point-to-point distance, and three angles measuredbetween the two normals and the direction vector), and ineach neighborhood, these values can be combined into ahistogram as detailed in [4]. We call these descriptions of3D features Point Feature Histograms.

This method of building histograms is a language that wecan use to describe, learn and compare surface types, andthat adapts to different means of acquisition. By comparingthe histograms using different distance metrics we are ableto detect the most dominant and most specific surface types,and use this information for segmentation and key featuredetection. Also, by comparing them to signatures obtainedfrom known shapes, we can classify the underlying surfacesat each point.

To reduce the computational complexity of determiningPoint Feature Histograms (or PFH) we have developedsimplified [5] and fast Point Feature Histograms (FPFHs)[3].

4) Surface and Volume Representations: The Semantic3D Object Perception Kernel is hybrid in the sense that itprovides a variety of alternative surface and volume repre-sentations (see Figure 5): points, triangle meshes, geometricshape coefficients, and 2D general polygons.The different representations are used for the followingpurposes:

• PCDs are the raw data that the robot acquires, eitherfrom laser or time-of-flight cameras, and used for build-ing the higher level representations of the environment;

• voxel/triangular mesh representations are used for col-lision and visibility checks, since they capture also theconnections between the different measurement points;

• labeled and segmented PCDs group the raw points intoregions that represent parts of different objects;

• polygonal structures (affordance representation)– handles and knobs are detected in the vicinity of

vertical planar structures that are furniture candi-dates, and are approximated by linear/cylindricaland disk/spherical models respectively;

– cuboids are formed from the furniture candidatefaces by approximating their depth using theirprojection on the closest parallel wall;

– planar polygons (tables, walls, ceiling, and doors)are formed by connecting the points on the bound-aries of these regions;

• geometric primitives are used to approximate the differ-ent objects located on horizontal planar structures;

• partial and completed models are needed for planninga grasp of the objects for manipulation.

5) CAD and Appearance Models for Object Recognitionand Localization: To facilitate visual object recognition andpose estimation, the Semantic 3D Object perception kerneluses CAD models of objects for predicting the appearance ofthe geometric shape in a RGB-camera. Given accurate CADmodels of objects, we can recognize and localize the objectsby their shape in images. High accuracy is particularlyimportant for sharp geometric edges.

Because the generation of tailored CAD models by hu-mans is tedious and their automatic generation difficult,COP-MAN has mechanisms to retrieve CAD models fromCAD libraries, such as Google 3D warehouse, and adaptingas needed through model morphing [6]. Another method weare currently investigating is the learning of 3D appearancemodels of objects, which are to include color models, sets ofpoint features or even complete visual reconstructions. Thestored information are point descriptors which are variationsof [7] or [8]. Given several good RGB views annotated with3D information, enough information can be reconstructedto render the object. For this we need a color or textureinformation for every face of an underlying triangulatedmesh. This can be extracted even without perfect registrationof 3D and 2D by an optimization process.

B. Interpretation/Abstraction Mechanisms

Besides data structures and models, the Semantic 3DObject perception kernel provides a number of functions thattake COP-MAN representations as their input and transformthem into other representations which are often more abstractand informative than the original ones.

Examples of such functions that are provided by our percep-tion kernel are the following ones:

• Planar decomposition: returns the planar segments thatare perpendicular to a given direction;

• Region segmentation: performs region growing on aplanar segment stopping at sudden curvature or intensitychanges;

• Boundary detection: returns the thick boundaries ofregions;

• Rectangle matching: four pairwise perpendicular linesare fit to the boundary points;

• Clusters on planar areas: connected clusters with foot-prints on the planar regions;

• Fitting shape primitives: fixtures and objects on thetables are decomposed into shape primitives;

• Functional reasoning: the segmentation is refined basedon the number and position of fixtures, additional split-ting lines are fit if necessary.

Horizontal planar substructures are interpreted as support-ing planes for dynamic objects. This significantly reduces thespace of possible positions for unlocalized dynamic objects.

C. Using the Perception Kernel: Building Task-specific Pro-cessing Pipelines

Using the Semantic 3D Object perception kernel, onecan select the appropriate data structures, models, and func-tions, in order to combine them into task-specific perception

(a) (b)

(c) (d)

(e) (f)

(g) (h)

(i) (j)Fig. 5. Pipeline used to build the static 3D semantic object model: acquirescans (a), integrate point clouds (b), extract vertical (c) and horizontal (d)planes, identify region boundary points (e) and detect fixtures for furnitureface candidates (f), search for connected clusters with footprints on tables(g) and fit shape primitives to them (h), refine furniture doors (i) and classifythe furniture candidates (j).

pipelines. An example of such a perception pipeline isCOP-MAN’s pipeline for building static environment models,which is sketched in the next section.

IV. DESCRIPTION OF THE FUNCTIONAL MODULES

Let us now present the four functional modules of COP-MAN in greater detail.

A. Acquisition of Static Environment Models

Figure 5 presents the COP-MAN’s processing pipelinefor the acquisition of static environment models. The firststep, namely the integration of individual point cloud scansinto the hybrid model, follows the geometrical processingpipeline described in [9], [10], and includes: statistical gross

outlier removal, feature estimation for each point in thedataset, a 2-step coarse to fine registration [5], and finally alocal re-sampling of the overlapping areas between scans [9].Their result is an improved point data model, with uniformlyre-sampled 3D coordinates, and partially noiseless. Thisconstitutes the input to the Semantic Mapping component.These general geometric mapping topics have are describedin [5], [9], [10].

The key functions employed in the semantic mappingpipeline include the following ones:

• a highly optimized major planar decomposition step,using multiple levels of detail (LOD) and localizedsampling with octrees;

• a region growing step for splitting the planar com-ponents into separate regions – region segmentation,boundary detection, rectangle matching, and functionalreasoning kernel functions;

• a model fitting step for fixture decomposition – shapeprimitive fitting function;

• finally a 2-levels feature extraction and classificationstep.

Additional information about the pipeline and its componentscan be found in [2].

B. Passive Perception System

When people go into the kitchen in order to get a glass ofwater they can — most of the time — answer questions like:“was the window open?” or “was the table already set fordinner?”. People subconsciously perceive their surroundingsin order to be aware of the environment state. Having thiskind of information, they do not need to look for everythingthey are asked questions about, but they can rather recallsome information in retrospect. Today’s autonomous robotstypically lack this powerful mechanism as a knowledgeresource for performing their activities.

COP-MAN provides passive perception modules as a ser-vice that can provide the robot with a continual informationflow about specified dynamic aspects of the environment.

For example, a household robot might need to be awareof the dynamic aspects of the kitchen it operates in: thethings on the table and counter without necessarily exactlyknowing where things are, what they are, what their shapeis. More detailed information has to be acquired on demandusing active and task-directed perception mechanisms (seeSection IV-C).

To achieve this task, we equip the robot with a passiveperception module consisting of four parts: i) the PCDacquisition component, ii) a pipeline for processing and inter-preting PCDs and generating representations, iii) a dynamicobject store with an update mechanisms, and iv) a loggingmechanism for the dynamic object store.

For our example, this could mean the following. The PCDacquisition component makes a sweep with the laser scannerevery n seconds where a sweep usually takes 1-2 seconds.The remaining point clouds are then processed by the inter-pretation and abstraction pipeline. In the interpretation stepwe use the planar surfaces in the static environment model

as our regions of interest. We then cluster points above thetable into hypotheses of objects and object groups. Figure 6shows the steps of this process.

Fig. 6. Left: detected table in a scan shown in brown. Right: highlightedhypotheses and table in the raw scan.

The object store then stores the hypotheses that the robotbelieves to be on the table. In general, maintaining a beliefstate about the dynamic objects in the environment is avery complex task which requires probabilistic tracking withobject identity management and diagnostic reasoning. Westart with a naive approach where we simply delete allthe hypotheses from the object store that should be in theacquired laser scan and then add all hypotheses that wereextracted from the new laser scan.2

The matching between saved and new hypotheses is doneby volume intersection, where the voxelized representationcan be exploited to find overlapping areas.

When a hypothesis is located in a previously unoccupiedposition, its points are saved along with its 2D coordinatesrelative to the supporting plane. In the case of previouslyseen clusters, the points in the new voxels are added, alongwith points from sparse voxels in the original representation.

To filter the clusters correctly, and minimize the effectof occlusions, moving objects, and non-overlapping scans,the information that the current viewpoint gives has to beincorporated – by checking the voxels of hypotheses if theyare in free, occupied or occluded space.

C. Task-directed Perception

Task-directed perception is the part of perception thatis needed by the robot in order to perform its primarytasks, such as setting the table and cleaning up. Typicaltask-directed perception actions include: the detection ofthe object to be manipulated, the task of recognizing andlocalizing it, building a geometric or an appearance modelfor it, etc.

COP-MAN uses a suite of task-directed perceptionpipelines implemented using the Semantic 3D Object per-ception kernel.

Scene perception for table setting. One task-directed per-ception pipeline [11] uses abstract web instructions importedfrom websites such as wikihow.com or ehow.com to interpreta table scene with respect to a given everyday activity. Using

2More powerful belief update mechanisms are on our agenda for futureresearch. Aspects of improvement are: handling occlusions and partial viewsdue to changes in the environment; dealing with object identities if objectson the plane are moved; refine hypotheses into objects and additionalinformation given available computational resources.

the specification, the pipeline can sometimes infer the regionsof interest and a set of relevant object categories. Theseare then fed to a 3D CAD-based visual object detectionalgorithm [12] which returns 6D object poses. The instancesof found objects and their poses are asserted to a factualOWL-like knowledge base.

Localizing known objects using combination of Time-Of-Flight and color camera techniques. Another pipelinerobustly fits CAD models in cluttered table setting scenes inreal-time for the purpose of grasping with a mobile manipula-tor. Our approach uses a combination of two different cameratechnologies, Time-Of-Flight (TOF) and RGB, to robustlysegment the scene (e.g., supporting planes) and extract objectclusters. Using an a-priori database of object models, we thenagain perform a CAD matching in 2D camera images.

Affordance-based perception of objects of daily use.The FPFHs presented in III-A.3 can be used to efficientlyclassify surfaces as concave, convex and edge [3] as shownin the left part of Figure 7. Combinations of a concave and aconvex part hint at the presence of a container, while edgesare typically formed by handles or stems. This informationcan then be used to adjust the manipulation strategy.

Fig. 7. Classified surface types on the left: concave parts shown red, convexones as green and stems/edges as blue. Classified object types on the right:mugs shown with blue background, bowls with pink, glasses with white,and wine/champagne glasses with green backgrounds.

Reconstructing objects. To facilitate efficient grasp plan-ning, we used hybrid representations for objects, decompos-ing them into parts approximated by shape primitives wherepossible, and by triangular meshes elsewhere. To avoid theproblems produced by triangular approximations of noisydata, these parts can be broken down and approximated bya collection of boxes or cylindrical parts. We assume thatscenes are physically stable, thus there is a strong bias onobject orientation when we’re dealing with objects that are onsupporting planes. We exploited this notion of standing/lyingobjects together with an assumption of a vertical symmetryaxis or plane in our recent work [13], and we want to expandthat to further generalize additional non-symmetric objects.

Learning appearance models of objects. - A localizationin 3D allows to extract the current appearance of the objectfrom the camera view. With a larger set of appearances, theforeground can be extracted more accurately by a globaloptimization over the measured poses and the extractedappearance. The resulting appearance allows to distinguishobjects that were already used.

Object classification. We employed the use of machinelearning classifiers to label the furniture pieces in the static

3D semantic object By combining FPFHs into a GlobalFPFH that describes a complete cluster, we can classifydifferent object types [14], as presented in the right part ofFigure 7.

Another approach we take is to combine features from3D data and images to classify objects in situations whenreconstruction is not possible [15]. These approaches can beextended to take into account the relative positions betweenobjects, and thus classify an arrangement of objects, improv-ing the object classification in the same time, by adding abias for more probable arrangements.

D. Perception as a “Virtual” Knowledge Base

COP-MAN also allows the robot control system to viewthe perceived world as a knowledge base that containsinformation about objects and scenes. In this operation modethe control system can query COP-MAN for information.If the information is already available, then the answer isimmediately returned. Otherwise COP-MAN initiates thenecessary perception steps in order to compute the infor-mation from already sensed data on demand or to acquire itfrom anew through active sensing.

In this operation mode, the dynamic object store of thepassive perception system is automatically asserted to theknowledge base. Thus, for each object/object group hy-pothesis generated by the passive perception system, COP-MAN generates an unique identifier id for the object hy-pothesis and asserts the following facts: hypothesis(id), pcd-representation(id,r), position(id,[x,y,z]), and on(id,table-id).

Let us consider the following example. The robot controlsystem needs to know the position of the yellow cup on thetable. To check whether or not a yellow cup is there, and toextract its position, the following PROLOG-like query canbe formulated:

?- type(Obj,cup), on(Obj,Tab), type(Tab,table),color(Obj,yellow), position(Obj,[X,Y,Z]).

Suppose we evaluate the query on the knowledge base thatonly contains the assertions made by the passive perceptioncomponent. By evaluating the query on the knowledge base,we get instantiations of each predicate for each objecthypothesis on the table. The only predicates that are notsatisfied are the color and the type predicate. In orderto check the validity of these statements, COP-MAN callstask-directed perception mechanisms on the point cloud datarepresentation of each object hypothesis.

If the query is already satisfied by the dynamic knowledgebase, then the perception system returns yes and binds therespective variables. No active perception is required in thiscase.

Viewing the perceived world as a first-order knowledgebase opens up exciting possibilities for making the robotmore cognitive. Below we will give very brief examplesfor reasoning about scenes, and for combining perceptualinformation with background knowledge.

1) Querying scenes. Providing perceptual routines forspatial relations such as “left of”, “next to”, etc., we

can assess complete scenes by comparing the spatialarrangements of objects on the table with the specifiedarrangement in web instructions. This way the robotcan potentially infer missing objects on the table orobjects that are not correctly arranged.

2) Combining perceived information with backgroundknowledge. Asserting in a knowledge base that cupsare drinking vessels and drinking vessels are containerswhich are to contain fluids, the robot is now able torecognize the objects that it can fill coffee in.

V. ENVIRONMENT ADAPTATION

The perception system should have initial capabilities butalso should be able to adapt to an environment [16]. Byknowing the specifics of the objects and the environment therobot is operating in, its perception methods can be mademore precise, robust and efficient. We want to point out fourpossible techniques to improve the available environment andobject models.

A. Environment Adaptation

Designers often improve the performance of robots byspecializing them. This is particularly important for humanliving environments where the word as well as the objects aredesigned to facilitate manipulation and to operate on them.

Thus, for our kitchen setting we can make the followingassertions:

• All task-relevant objects are either in the hand ofsomebody or lying on top of supporting planes, thatmeans the state is physically stable [17], [18]. Also,there is a strong bias on object orientations.

• Supporting planes are horizontal planar surfaces abovethe floor and below the ceiling (tables, shelves) whichare the relevant objects for pick-and-place tasks. Laterwe will include container objects such as cupboards, butalso pots, boxes, rigid bags, and tablets.

To simplify but also enforce these assertions, one couldconsult and use the guidelines for ADA-compliant homes,for example.

B. Acquisition of new CAD models

The system is confronted with the problem of new objectsappearing in its environment, that should be recognized andmanipulated. Regarding CAD models for object recognitionand localization, we see two major possibilities to adapt tocertain environment changes: i) either use semantic knowl-edge about objects to acquire CAD models from Internetdatabases an verify them visually or ii) use 3D data andtransform it to a CAD model. For good results, this wouldrequire a robot to get a large number of good quality viewsof the target object actively.

C. Adapting CAD models by Morphing

Given several CAD models from external sources, thatapproximate the data well but not perfectly to a real object,we can try to improve them by interpolating between twosuch models using a technique called morphing, which we

introduced for this purpose in [6] and described it in moredetail in [19]. If the new morphed model is better than theoriginal two models, it can be visually verified by searchingthe object using all three models in several views whileobserving the score of the match. The score is defined by thenumber of corresponding edge pixels between the expectededge projections and the measured image pixels in relationto the number of expected edge pixels.

D. Specializing from Shape to Appearance

Given the occurrence of objects with the same or similarshape, the system adds an appearance description to thecurrent object model. This requires, that we already haveinformation about one or all of the objects we want todistinguish. For example, two instances can be sometimesdistinguished by a simple description like a global colormodel. Given similar lightning conditions for two objects, thecolor can be recognized as different enough by comparingthe color histogram of the segmented regions containing theobjects. The same can be inferred for an appearance modelby trying to match an a priori learned point descriptor modelto another object. The worse it matches the more useful thenewly learned model is.

VI. CONCLUSIONS AND OUTLOOK

The presented system provides the means for solving inter-esting perception for manipulation tasks through adaptationto the environment. COP-MAN can draw on the backgroundknowledge provided by the model of the static part of theenvironment, and provides up-to-date information about thedynamic aspects necessary to perform pick-and-place taskssuccessfully. As a knowledge base, COP-MAN can be used toanswer queries of the tasks executives about object locations,arrangements, state changes, which trigger the executionof additional perception routine pipelines to fill in missinginformation.

While the presented pipelines work robustly for differentkinds of sensors, the problem of sensor capabilities stillremains an open question.

It might be necessary for example, that in order to obtaina good model of an object, that object has to be picked upbased on a rougher model and moved closer to one of thesensors. This will require even deeper integration betweenthe different systems.

ACKNOWLEDGEMENTS

This work is supported by the CoTeSys (Cognition forTechnical Systems) cluster of excellence.

REFERENCES

[1] R. B. Rusu, I. A. Sucan, B. Gerkey, S. Chitta, M. Beetz, andL. E. Kavraki, “Real-time Perception-Guided Motion Planning fora Personal Robot,” in Proceedings of the IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), St. Louis, MO,USA, October 11-15 2009.

[2] R. B. Rusu, Z. C. Marton, N. Blodow, A. Holzbach, and M. Beetz,“Model-based and Learned Semantic Object Labeling in 3D PointCloud Maps of Kitchen Environments,” in Proceedings of theIEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), St. Louis, MO, USA, October 11-15 2009.

[3] R. B. Rusu, A. Holzbach, N. Blodow, and M. Beetz, “Fast GeometricPoint Labeling using Conditional Random Fields,” in Proceedingsof the IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS), St. Louis, MO, USA, October 11-15 2009.

[4] R. B. Rusu, Z. C. Marton, N. Blodow, and M. Beetz, “LearningInformative Point Classes for the Acquisition of Object Model Maps,”in Proceedings of the 10th International Conference on Control, Au-tomation, Robotics and Vision (ICARCV), Hanoi, Vietnam, December17-20 2008.

[5] R. B. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms(FPFH) for 3D Registration,” in Proceedings of the IEEE InternationalConference on Robotics and Automation (ICRA), Kobe, Japan, May12-17 2009.

[6] U. Klank, M. Z. Zia, and M. Beetz, “3D Model Selection from anInternet Database for Robotic Vision,” in Proceedings of the IEEEInternational Conference on Robotics and Automation (ICRA), Kobe,Japan, May 12-17 2009.

[7] V. Lepetit and P. Fua, “Keypoint recognition using randomized trees,”Pattern Analysis and Machine Intelligence, IEEE Transactions on,vol. 28, no. 9, pp. 1465–1479, Sept. 2006.

[8] D. G. Lowe, “Distinctive image features from scale-invariant key-points,” International Journal of Computer Vision, vol. 60, no. 2, pp.91–110, 2004.

[9] R. B. Rusu, Z. C. Marton, N. Blodow, M. Dolha, and M. Beetz,“Towards 3D Point Cloud Based Object Maps for Household Envi-ronments,” Robotics and Autonomous Systems Journal (Special Issueon Semantic Knowledge), 2008.

[10] R. B. Rusu, Z. C. Marton, N. Blodow, M. E. Dolha, and M. Beetz,“Functional Object Mapping of Kitchen Environments,” in Proceed-ings of the 21st IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), Nice, France, September 22-26 2008.

[11] D. Pangercic, R. Tavcar, M. Tenorth, and M. Beetz, “Visual scenedetection and interpretation using encyclopedic knowledge and formaldescription logic,” in Proceedings of the International Conference onAdvanced Robotics (ICAR)., 2009.

[12] M. Ulrich, C. Wiedemann, and C. Steger, “Cad-based recognition of 3dobjects in monocular images,” in International Conference on Roboticsand Automation, 2009, pp. 1191–1198.

[13] Z. C. Marton, L. Goron, R. B. Rusu, and M. Beetz, “Reconstructionand Verification of 3D Object Models for Grasping,” in Proceedingsof the 14th International Symposium on Robotics Research (ISRR),Lucernce, Switzerland, August 31 - September 3 2009.

[14] R. B. Rusu, A. Holzbach, G. Bradski, and M. Beetz, “Detecting andSegmenting Objects for Mobile Manipulation,” in Proceedings of IEEEWorkshop on Search in 3D and Video (S3DV), held in conjunction withthe 12th IEEE International Conference on Computer Vision (ICCV),Kyoto, Japan, September 27 2009.

[15] Z. C. Marton, R. B. Rusu, D. Jain, U. Klank, and M. Beetz,“Probabilistic Categorization of Kitchen Objects in Table Settings witha Composite Sensor,” in Proceedings of the IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), St. Louis, MO,USA, October 11-15 2009.

[16] I. Horswill, “Analysis of adaptation and environment,” Artificial Intel-ligence, vol. 73, pp. 1–30, 1995.

[17] J. M. Siskind, “Reconstructing force-dynamic models from videosequences,” Artif. Intell., vol. 151, no. 1-2, pp. 91–154, 2003.

[18] R. Mann, A. Jepson, and J. M. Siskind, “Computational perceptionof scene dynamics,” in Computer Vision and Image Understanding,1996, pp. 528–539.

[19] M. Z. Zia, U. Klank, and M. Beetz, “Acquisition of a Dense 3D ModelDatabase for Robotic Vision,” in Proceedings of the InternationalConference on Advanced Robotics (ICAR), Munich, Germany, June22-26 2009.

cop-m — perception for mobile pick-and-place in human ...robot perception systems for manipulation...

Documents