synthesis of dance performance based on analyses of human ...siratori/pub/tcvim2008.pdf · features...

Vol. 1 No. SIG 1(CVIM 1) IPSJ Transactions on Computer Vision and Image Media June 2008

Regular Paper

Synthesis of Dance Performance Based on Analyses of Human Motion and Music

TAKAAKI SHIRATORI †1 and KATSUSHI IKEUCHI†2

Recent progress in robotics has a great potential, and we are considering to develop a dancing humanoidrobot for entertainment robots. In this paper, we propose three fundamental methods of a dancing robotaimed at the sound feedback system in which a robot listens to music and automatically synthesizes dancemotion based on the musical features. The first study analyzes the relationship between motion and musicalrhythm and extracts important features for imitating human dance motion. The second study models modifi-cation of upper body motion based on the speed of played music so that a humanoid robot dances to a fastermusical speed. The third study automatically synthesizes dance performance that reflects both rhythm andmood of input music.

1. Introduction

Since technology regarding humanoid robots isadvancing rapidly, many research projects relatedto these robots have been conducted. To add tothis research, we are considering to enhance humandance moiton for entertainment robots, and aimingat mimicing dance performance with a biped hu-manoid robot.

We developed an algorithm to enable a humanoidrobot to represent dance performance17). This al-gorithm is based on a Learning-from-Observation(LFO) paradigm that has a robot directly acquirethe knowledge of what to do and how to do by ob-serving a human demonstration. This paradigm isnecessary because the difference in body structurebetween a robot and a human performer makes itimpossible to directly map human motion to a robotthat needs to maintain its balance. We designed taskmodels for leg motion of dance performance basedon contact states, and we have used these to auto-matically modify human motion for use by a robot.

However, there still remains the problem that thisalgorithm is done offline and a robot cannot lis-ten to music and respond to it when performing adance. In this paper, we propose three fundamen-tal techniques aimed at the achievement of dancing-to-music ability. This ability, which we call soundfeedback system, indicates that people can synchro-nize their dance motion with various musical fea-tures such as rhythm, speed, and mood, even if theyare novices. The ultimate goal is that a robot listensto currently played music and automatically syn-chronizes or composes dance motion synchronized

†1 Department of Information and Communication Engineer-ing, Graduate School of Information Science and Technol-ogy, The University of Tokyo

†2 Interfaculty Initiative in Information Studies, The Univer-sity of Tokyo

with the music.

2. Overview of Proposed Methods

Considering characteristics of actual dance per-formances, there are various musical features af-fecting dance motion. We mainly focused on thefollowing correspondences between music and mo-tion:Rhythm Rhythm is one of the most important

features for dance performance. Even novicescan recognize musical rhythm, and easily clapor wave their hands and legs in response to it.

Speed Dance motion should be synchronizedwith the speed of musical rhythm. Usually,when music gets faster, dance motion is modi-fied to follow up the musical speed.

Mood Some dance performances are much af-fected by musical moods such as happiness andsadness. Even if we don’t dance to music, wefeel quiet and relaxed when listening to relax-ing music such as a ballad, and we feel excitedwhen listening to intense music such as hardrock music.

Based on these factors, we developed three meth-ods to analyze and synthesize dance motion withmusical features based on human perceptions.

The first study described in Section 3 is to ana-lyze the relationship between motion and musicalrhythm and to extract important stop features in or-der to mimic human dance motion. The goal ofthis study is to distinguish which features shouldbe preserved for dance motion imitation. Accord-ing to observation of human dance motion, motionrhythm is represented with a stop motion called akeypose, at which dancers clearly stop their move-ments, and the motion rhythm is synchronized withmusical rhythm when performing a dance. The pro-posed method aims to reveal this relationship.

The second study described in Section 4 is to

34

Vol. 1 No. SIG 1(CVIM 1) Synthesis of Dance Performance Based on Analyses of Human Motion and Music 35

model how to modify upper body motion basedon the speed of played music. When we observedstructured dance motion performed at a normal mu-sic playback speed and motion performed at fastermusic playback speed, we found that the detail ofeach motion is slightly different while the whole ofthe dance motion is similar in both cases. To provethis, we analyzed the motion differences in the fre-quency domain, and obtained two insights on theomission of motion details.

The third study described in Section 5 is to syn-thesize dance performance that is well matched tothe mood of the input music. We mainly focuson intensity in dance performance as a mood fea-ture. We designed an algorithm to synthesize newdance performance by assuming these relationshipbetween motion and music rhythm mentioned in thefirst study, and the relationship between motion andmusic intensity. However, dance motion with highintensity is difficult to reproduce with a biped robotdue to balance maintenance, and our target in thisstudy is CG character animation. But we believethat this method will be applicable to a robot in thefuture.

3. Keypose Extraction for Imitating DanceMotion26)

Mapping human motion to a humanoid robot is adifficult problem, and understanding what featuresin human motion are important and using the fea-tures for reproduction with robots has been wellstudied. Some previous methods have actually ex-tracted abstract models by recognizing what to doand how to do it, and generating motion for robotsfrom the models7),8),16),20),28). However, we found aproblem that the traditional techniques tended to ex-tract too many models from dance motion19), and itis nearly impossible to distinguish what is truly im-portant for dance performance imitation. This sec-tion describes a novel method to analyze the rela-tionship between important postures in dance mo-tion and musical rhythms in order to understandthe essential features of dance motion. We refer tothese important postures as keyposes.

According to Flash et al.3), every human motionconsists of several motion primitives, which de-note fundamental elements of human motion, andthese primitives are segmented by detecting in-stances when hands and feet stop their movements.In whole body motion, there are many methods tosegment human motion by detecting the local min-ima of end-effector speed and to classify the mo-tion segment into several clusters by calculatingco-occurrence21), by using Hidden Markov Models

(HMM)7), or by applying a spatio-temporal isomapfor dimensionality reduction8). Kahol et al.9),10)

proposed a motion segmentation method using ap-proximated physical parameters such as force, mo-mentum and kinetic energy. We decided to followthis biomechanical concept basically, and thus wedefined that keyposes in dance motion as stoppingpostures.

In addition, we focused on the rhythm of danceperformance. The motion rhythm of most danceperformance corresponds to music rhtyhm, andsome prior work on character animation uses thisproperty for animated motion synthesis1),12),13). Soour method consists of a motion analysis step thatextracts stopping postures from motion and a mu-sic analysis step that extracts rhythm from music.Combining motion and musical information allowsthe motion’s keyposes to be established.

3.1 Rhythm Tracking from Music SequenceTo estimate musical rhythm, we use the following

known principles:Principle 1: A sound is likely to be produced

consistent with the timing of the rhythm.Principle 2: The interval of the onset component

is likely to be equal to that of the rhythm.The onset component represents the spectral

power increase from the previous temporal frame,and we use Goto et al.’s method5) for the extrac-tion of onset components. By applying an auto-correlation function to time series of onset compo-nents, we can estimate the rhythm of music.

3.2 Keypose Candidate Extraction from Mo-tion Sequence

Our motion analysis method is based on thespeed of a performer’s hands, feet and center ofmass (CM). In many forms of dance, includingJapanese traditional dance, the movements of handsand feet have a strong relationship with the in-tended expression of the whole body. Therefore,the speed of the hands and feet is useful for extract-ing stop motions. However, this is not sufficient forkeypose extraction because sometimes the dancermakes rhythm errors, or dances are varied by thepreferences or the genders of performers, etc. Soin addition to the motion of the hands and feet, ouralgorithm uses the motion of the body’s CM calcu-lated from standard mass distribution. The motionof the CM represents the motion of the whole body;thus, the effects of missteps and individual differ-ences are less. Through this step, we extract motionkeypose candidates that satisfy the following crite-ria:( 1 ) Dancers clearly stop their movements.( 2 ) Dancers clearly move their body parts dur-

36 IPSJ Transactions on Computer Vision and Image Media June 2008

Fig. 1 Keypose candidate extraction for hand and CMmotions.

Fig. 2 Keypose candidate extraction for foot motions.

ing neighboring keyposes.The second criterion is necessary to avoid verysmall spikes of speed sequence produced by noise.

3.2.1 Hand and CM MotionsThe speed of hand and CM motions approaches

zero for stopping movements as shown in Figure 1.To extract keypose candidates for hand and CM mo-tions, we modify the criteria described above as:( 1 ) Each candidate is a local minimum in the

speed sequence, and the local minimum isless than a minimum speed threshold.

( 2 ) The local maximum between two successivecandidates is larger than a maximum speedthreshold.

3.2.2 Foot MotionsIn foot motions, one leg often functions as a sup-

porting leg while the other leg is functioning asa swing leg. Thus, the speed sequence for footmotions often consists of a series of bell-shapedcurves, as shown in Figure 2. To extract keyposecandidates from foot motions, we first extract therise and decay of the feet speed sequences. Then,the area between the rise and decay, which showshow far each foot moved while it was used as aswing leg, is calculated. If the area is larger thana trajectory length threshold, these rise and decaybecome candidates.

3.3 Keypose Candidate Refinement UsingMusical Rhythm

The next step is to refine the keypose candidatesby considering musical rhythm. For each speed se-quence, our method tests whether there are candi-

Fig. 3 Refinement of the keypose candidates using musicalrhythm.

dates around musical rhythm inflection points asdetected from the onset components. If there is akeypose candidate, it is possible that there is a stoppoint in the musical rhythm, and if so, this keyposecandidate is retained for the final step.

Figure 3 illustrates the keypose candidate refine-ment process. Musical rhythms are represented asvertical solid lines. There are keypose candidatesaround second and fourth musical rhythms, repre-sented in the figure by green vertical lines, and thesecandidates are preserved for the keypose extractionprocess.

3.4 Keypose ExtractionIn the next phase, keypose candidates of a dance

performance are subjected to two further criteria:( 1 ) Retained keypose candidates must include a

match in time between more than two of thefollowing: left hand, right hand, or feet.

( 2 ) Retained keypose candidates must include aCM keypose match.

For example, the first criterion would be satisfied bykeypose candidates of the left hand, the right hand,and one foot if these match in time. It would not besatisfied by keypose candidate time-matches in onlyone foot and one hand. In other words, the first cri-terion can extract poses at which dancers stop themovements of their hands and feet even when thestopping instance of each body part is slightly dif-ferent. These poses are likely to be stop motions.

But this first criterion may extract poses that arenot considered to be keyposes. For example, con-sider walking motion. In this motion, a performer’shands nearly stop their movements when his or herfeet are on the ground. However, the body keepsmoving in the forward direction, and this pose can-not usefully be considered a stop motion. Suchtranslations are common in dance, so we define thesecond criterion to help eliminate false positives(keypose candidates labeled as valid poses, whenin fact they are not); both criteria must be simulta-neously satisfied to retain a keypose candidate.


Fig. 4 Subset of extracted keyposes from the Aizu-bandaisandance (top) and dancer’s understanding (bottom).

3.5 ExperimentsOur proposed method was evaluated using mo-

tion capture data and music data of two dancesequences: the Aizu-bandaisan dance and theJongara-bushi dance. To evaluate the effectivenessof our proposed method, we compared the resultsof our keypose extraction method with those fromNakazawa et al.’s method19), which uses only mo-tion capture data to extract keyposes. Additionally,we compared the results of our method with theimportant postures manually extracted by dancers.We refer to these dancers understandings as groundtruth of the keyposes.Aizu-bandaisan Dance

A subset of our method’s analysis of a femaledance master performing the Aizu-bandaisan danceis shown in Figure 4. Our method correctly ex-tracted all of these keyposes with no false positivesand no mis-detected errors. The previous methodwhich considers only motion capture data extracted8 of the 9 true keyposes correctly, but generated 4false-positives and mis-detected 5 errors.Jongara-bushi Dance

A subset of results of our extraction method for adancer performing the Jongara-bushi are shown inFigure.5. This dance has 12 true keyposes. The pre-vious method extracted 6 of these, and had no mis-detected errors. In contrast, our method extracted 9correct keyposes, with no mis-detected errors. Webelieve that our method failed to detect 3 keyposesbecause of the high speed of this dance.

From these results, we concluded that our methodwas much more accurate than the previous method.We believe that the reason for our superior resultsis that our method considers not only motion cap-ture data but musical information, while the pre-vious method considers only motion capture data.

Fig. 5 Subset of extracted keyposes from the Jongara-bushidance (top) and dancer’s understanding.

By incorporating an analysis of a dance’s musicalrhythm, we reduced the number of false positivesthat previous methods have generated due to thehigh degree of freedom of any articulated figure.

4. Synthesis of Temporally-Scaled DanceMotion Based on Observation25)

Temporal scaling of dance motion is very impor-tant for development of the sound feedback system,since there is a possibility that the speed of pre-recorded dance motion data is different from thatof a live music performance. If this occurs, a robotcannot collaborate with music performers withoutthis ability. In this section, we propose a novelmethod to temporally scale upper body motion in-volved in dance performance for synchronizationwith music.

Many researchers have attempted to efficientlyedit and/or synthesize human motion from a singlemotion sequence through such procedures as edit-ing motion capture data by signal processing tech-niques2), retargeting motion to new characters4),and modifying human motion to make it funny29).However, there are no methods to generate tempo-rally scaled human motion except McCann et al’smethod15) that aimed at temporal scaling of jump-ing motion by considering physical laws. Theirmethod does not work well for non-jumping mo-tion.

To achieve this goal, we first observe how dancemotion is modified based on played musical speed,and then we model the modification based on theacquired insights. When we observe structureddance motion performed by humans at normal mu-sic playback speeds versus motion performed usingmusic that is 1.3 times faster, we find that the de-tails of each motion sequence differ slightly, thoughthe whole of the dance motion sequence is similarin both cases. An example of this type of motionmodification, which is natural in humans, is shownin Figure 6. This phenomenon is derived from thefact that dancers omit details of a dance, but retainits essential aspects, if this is necessary to follow


Fig. 6 Comparison of hand trajectory differences dependingon music playback speed. The green and yellow curvesrepresent the hand trajectories at a normal musicalspeed, and 1.3 times faster musical speed, respectively.

faster music. If we therefore observe motion differ-ences in dances performed at different speeds in thefrequency domain, we can obtain useful insights re-garding motion detail omission. Based on these in-sights, we propose a new modeling method and de-velop some applications useful for humanoid robotmotion generation.

4.1 Observation of Human Dance MotionThis section denotes how we observed the re-

lationship between human motion and musicalrhythm speed.

4.1.1 Motion ObservationUsing HierarchicalB-spline

Hierarchical B-spline consists of a series of B-spline curves with different knot spacings; higherlayers of a hierarchical B-spline are based on finerknot spacing that can preserve the higher frequencycomponents of the original sequence. Each sub-ject motion sequence has a different underlyingmusical rhythm, and a B-spline allows us to con-trol frequency resolution by only setting its controlpoints at desired temporal intervals. In our analy-sis, we considered musical rhythm for knot spacing,and we normalized temporal frames of motion se-quences into the knot space with musical rhythm.Then we applied hierarchical B-spline decompo-sition to joint angles of dance motion via a leastsquares solution14).

4.1.2 Observation Using Hierarchical B-spline

Using an optical motion capture system, wecaptured the Aizu-bandaisan dance, a classicalJapanese folk dance, at three varying musicalspeeds for observation: the original speed, 1.2 timesfaster speed, and 1.5 times faster speed. Motion se-quences at each speed were captured five times inorder to investigate motion variance, so a total of15 datasets were considered in this experiment. Weset the knot spacing to the musical rhythm, and thenapplied a hierarchical B-spline decomposition tech-

nique. We used up to five layers in our motion de-composition and observed the mean and variance ofeach reconstructed motion. Our choice of five lay-ers was arbitrary, but it was empirically found to beenough to reconstruct high-frequency componentsof human motion.

Figure 7 shows mean joint angle trajectories ofthe left shoulder. With regard to motion recon-structed from a single-layer B-spline (Figure 7 (a)),the motion at the 1.2 times faster musical speed isquite similar to the motion at the normal musicalspeed. The motion at the 1.5 times faster musi-cal speed is also similar to the motion at the nor-mal speed, but their details such as curvature dif-fer slightly from each other. With regard to mo-tion reconstructed from a two-layer hierarchical B-spline (Figure 7 (b)), the shape of the joint angle tra-jectory at the normal musical speed differs slightlyfrom that of the 1.2 times faster musical speed, es-pecially in the trajectory’s sharpest curves. On theother hand, the shape of the joint angle trajectoryat the 1.5 times faster musical speed appears tobe a smoothed version of the normal music play-back speed trajectory. With regard to motion re-constructed from a three-layer hierarchical B-spline(Figure 7 (c)), the differences among the joint angletrajectories become more noticeable. The shape ofthe joint angle trajectory at the 1.5 times faster mu-sical speed is a much smoother version of the trajec-tory at the normal musical speed, whereas the shapeat the 1.2 times faster musical speed is just a slightlysmoother version of the trajectory at the normal mu-sical speed. As for motion reconstructed from afour-layer hierarchical B-spline and a five-layer hi-erarchical B-spline, these phenomena appear moreclearly.

Figure 8 shows a comparison of variance se-quences; the green, yellow, and light blue lines rep-resent the variance sequences of the left shoulderjoint angle at the normal musical speed, 1.2 timesfaster musical speed, and 1.5 times faster musicalspeed, respectively, and the blue line represents thevariance sequence calculated from all the motionsequences. The joint angles for the variance cal-culation were reconstructed with a five-layer hier-archical B-spline, and were normalized by adjust-ing the knot of the estimated control points. Fromthese variance sequences, it is confirmed that thereare some valleys where each variance sequence islocally minimum. This means that the postures atthese valleys (the middle row of Figure 8) are pre-served even if the musical speed gets faster andthe high frequency components are attenuated. Wefound that most valleys represent important stop


(a) (b) (c)Fig. 7 Comparison of mean joint angle trajectories at the original musical speed (green),

1.2 times faster speed (yellow), and 1.5 times faster speed (light blue). Top: wholetrajectories, and bottom: closer-look at top-left. (a) mean motion using a single-layer B-spline, (b) mean motion using a two-layer hierarchical B-spline, and (c)mean motion using a three-layer hierarchical B-spline.

Fig. 8 Comparison of variance sequences at the original musical speed (green), 1.2 timesfaster speed (yellow), and 1.5 times faster speed (light blue). The blue line repre-sents the variance calculated from all the motion sequences.

motions specified by the dance masters (the bottomrow of Figure 8), and that they are very close to theresults of our keypose detection method describedin Section 3.

From these observations, we obtained the follow-ing two insights:Insight 1: High-frequency components of human

motion will be attenuated when the music play-back speed becomes faster.

Insight 2: Keyposes will be preserved even ifhigh frequency components are attenuated.

Based on these insights, we propose a method tomodel the temporal scaling of human dance motion.

4.2 Upper Body Motion Generation By Tem-poral Scaling

This section describes how to temporally scalehuman motion based on the insights.

4.2.1 Hierarchical Motion Decomposition Us-ing Keypose Information

According to Insight 2, keypose information, in-cluding posture and velocity components, is pre-served even if the musical speed is fast. There-fore, low frequency components of dance motionsequence must contain the keypose information.Remembering this insight, we improved the methodof the traditional motion decomposition. To achievethis, our motion decomposition method needs toconsider the posture and velocity information of thekeyposes.

To consider posture information, we denselysample input motion sequence around keyposes,we sparsely sample it in other parts, and then weuse these samples to form a linear system of equa-tions. Figure 9 provides an illustration of our data-sampling method for motion decomposition. All


Fig. 9 Illustration of our sampling method to consider keyposeinformation for hierarchical motion decomposition.

vertical lines in this illustration represent originallysampled data, and our method uses only the solidlines shown among them.

With regard to velocity information, the move-ments of a dancer’s arms and hands stop aroundkeyposes: the velocity of the hands and arms areapproximately zero at keyposes. We exploit thisuseful property of keyposes as velocity informationin our motion decomposition method. From all thekeyposes, we form a linear system of equations tosatisfy the velocity constraints. For each layer of hi-erarchical B-spline, we estimate the control pointsfrom posture and velocity constraints using pseudoinverse matrix and decompose the input motion se-quence.

4.2.2 Motion Generation Based on Mechani-cal Constraints

The final step is to generate temporally-scaledmotion for a humanoid robot. Simple temporalscaling can be done by adjusting the temporal frameof B-spline control points with the specified scalingratio. However, the resulting motion may violateangular limitations such as joint angular velocity.To solve this, we consider Insight 1 and mechani-cal constraints that a humanoid robot has, and wemodify upper body motion.

In this step, we first segment the motion sequenceto correspond to music rhythm frames, and then weoptimize weighting factors for each hierarchical B-spline layer in each motion segment so that a result-ing motion sequence must satisfy certain mechani-cal constraints. The resulting joint angle θopt is rep-resented as

θopt(t) =N∑

i=1

wifi(2i−1st), (1)

where s represents a temporal scaling factor (i.e.the resulting motion is s-times faster than the origi-nal motion), N represents the number of hierarchi-cal B-spline layers, and fi represents the i-th layerof the constructed hierarchical B-spline. wi ∈ [0, 1]

Fig. 10 Motion reconstruction considering mechanicalconstraints.

represents the weighting factor for the i-th layer tobe detected via this optimization process.

According to Insight 1, the high frequency com-ponent is attenuated when the motion is beyondjoint angle limitations. Therefore, this optimizationprocess is done by attenuating the weighting fac-tors from the finest layer. When the weighting fac-tor reaches zero and the resulting motion does notsatisfy mechanical constraints, the weighting factorfor the next coarser layer is then gradually attenu-ated. Finally, when the resulting motion consists ofn layers, the weighting factors from the 1st to the(n − 1)-th layers are 1, the factor for the n-th layeris with in a range of (0, 1], and the factors from the(n+1)-th to the N -th layers are 0. This is illustratedin Figure 10.

In this process, a discontinuity might develop be-tween neighboring motion segments if there endsup being a difference in the weighting factors.So we simply apply motion blending around thediscontinuities using spherical linear interpolation(SLERP) for joint angles. Through this interpola-tion process, there is a possibility of going beyondmechanical limitations. So we iteratively do the op-timization and interpolation procedures until the re-sulting motion does not violate the mechanical con-straints.

4.3 ExperimentsIn this section, we show the results of the exper-

iments. We tested our algorithm by modifying theAizu-bandaisan dance data through our algorithm.We applied the proposed method to the upper bodymotion, and applied Nakaoka et al.’s method to gen-erate leg motion16). Our experimental platform wasHRP-2.

4.3.1 Result of Original-Speed Motion Gen-eration

We first tested our algorithm by generating thedance motion for the HRP-2 at the normal speed. Inthis experiment, we compared our method with Pol-lard et al.’s method that can adapt motion capturedata for a humanoid robot using PD controller22).


Fig. 11 Result of generating the Aizu-bandaisan dance motion at the original musical speed.

(a) (b) (b.1) (b.2)Fig. 12 Comparison of left shoulder yaw angle trajectories of the original motion (red),

generated by Pollard et al.’s method (green), and generated by our method (blue).(a): joint angle trajectories, and (b): joint angular velocity. (b.1) and (b.2) repre-sent the zoomed-in graph of part (1) and (2) in (b), respectively.

Figure 11 shows the experimental result with theactual HRP-2. It is confirmed that the robot canstably imitate the human dance motion.

Figure 12 shows the resulting joint angle trajec-tories of the left shoulder yaw. As for the jointangular velocity (Figure 12 (b)), our method hastwo advantages. One is that our method can pre-serve more details than the trajectories resultingfrom Pollard et al.’s method. The trajectories re-sulting from Pollard et al.’s method often lack highfrequency components, due to the PD control. Thisphenomenon is shown in Figure 12 (b.1). The otheris that the speed around motion frames when speedlimitations are violated is fixed to a constant valuein the motion generated by Pollard et al.’s method.This phenomenon is shown in Figure 12 (b.2). Thiscan create two problems. One is that the humanoidrobot cannot clearly reproduce a keypose if the pos-ture and angular speed around the keypose violatekinematic constraints. The other is that the hu-manoid robot may fall because of the rapid changesin acceleration.

4.3.2 Simulation Result of 1.2 Times FasterMotion Generation

Next, we tested our algorithm by generating thedance motion whose speed was 1.2 times faster thanthe original speed in simulation. The upper bodymotion was generated by our proposed method, andas for leg motion, we first applied simple tempo-

Fig. 13 Simulation result of generating 1.2 times faster dancemotion. The red sphere represents the ZMP of the re-sulting motion.

ral scaling to the motion capture data and then ap-plied Nakaoka et al.’s method. Figure 13 shows thesimulation results, and the red sphere represents aZero Moment Point. Our simulated motion satis-fied the criterion for balance maintenance, and thehumanoid robot successfully performed the dance.

5. Dancing-to-Music Character AnimationBased on Aspects of Mood27)

The previous section describes a method to en-able a robot to dance to musical rhythm and speed.But some dance performances are improvisational


based on music mood. The method described in thissection focuses on this type of dances and its pur-pose is to synthesize dance motion well matchedto music mood features. Though this method isonly for CG characters because of intense motionssuch as jumping, this will be also applicable to ahumanoid robot in the future.

Dancers can simultaneously compose a dancemotion based on the musical sounds they are listen-ing to. Although this ability may appear amazing,actually these performers do not create these mo-tions, but rather combine appropriate motion seg-ments from their knowledge database with musicas their key to perform their unique movements.Considering this ability, we are led to believe thatdance motion has strong connections with music inthe two following aspects: 1) The rhythm of dancemotions is matched to that of music, and 2) the in-tensity of dance motions is synchronized to that ofmusic. The first assumption is derived from ourwork described in Section 3, while the second as-sumption is derived from the fact that people tendto feel quiet and relaxed when listening to relaxingmusic such as a ballad but may feel excited whenlistening to exciting music.

Kim et al.12) proposed a rhythmic motion synthe-sis method using the results of motion rhythm anal-ysis. Alankus et al.1) and Lee et al.13) also proposeda method to synthesize dance motion by consider-ing the rhythm of input music. The drawback ofthese methods is to consider only musical rhythm;because of this, it is very difficult to synthesize ex-pressive dance motion.

Our approach consists of three steps: a motionanalysis, a music analysis, and a motion synthesisbased on the extracted features. In the motion anal-ysis step, we analyze rhythm and intensity featuresof input dance motions, and assign the features toeach motion in a database. In the music analysisstep, first, we analyze a structure of input music se-quence, and extract music segments based on thestructure analysis results. Next, musical rhythm andintensity features are extracted, and are assigned toeach music segment. Finally, our method automat-ically synthesizes new dance motion by interpolat-ing between the motion segments.

5.1 Motion Feature AnalysisOur motion analysis method strongly relies on

Laban’s weight effort component. In this section,we describe our definition of the weight effort com-ponent and how to extract the motion features.

5.1.1 Weight EffortAccording to Laban’s theory, the emotion of hu-

man motion comes from motion features consisting

of effort and shape components. The effort compo-nent is defined as the movements of body portions,and the shape component is defined as the shapeof elements. More recently, Nakata et al.18) havetested the validity of Laban’s theory by using theirsmall robot and user studies. Although they couldnot find a significant relationship between the shapecomponent and any emotions, they found that theweight effort component, one of the effort compo-nents, is closely related to the excitement of the mo-tion. Thus, we define the weight effort componentW as the linear sum of approximated instantaneousmomentum magnitude calculated from the link andbody directions:

W (f) =X

i

αi arccos

„

vi(f)

|vi(f)| ·vi(f + 1)

|vi(f + 1)|

«

+X

j∈{x,y,z}

arccos

„

rj(f)

|rj(f)| ·rj(f + 1)

|rj(f + 1)|

«

,

(2)where αi is a regularization parameter for the i-th

link, vn is a unit vector representing the directionof the n-th body link in the body center coordinatesystem, and rx, ry and rz represent 3-dimensionalorientation of body.

5.1.2 Motion Rhythm FeatureConsidering the characteristics of the weight ef-

fort component, the local minimums of this compo-nent indicate stop motions, which are impressive in-stances for dance performance. We recognize theselocal minima as motion rhythm.

5.1.3 Motion Intensity FeatureIt was validated that motion intensity is related

to momentum and forward translation. We obtaininstant motion intensity I from the momentum Wand the speed of the forward direction:

I(f) = W (f) · (1.0 + k · ry(f) · t(f)), (3)where k is a regularization parameter between theweight effort and the speed, and ry · t represents thespeed of body direction change. Finally, we calcu-late the average of the instant motion intensity fromthe previous motion rhythm to the next one, and setit to the motion intensity.

5.2 Music Feature AnalysisWhen people listen or dance to music, they ex-

tract some musical features from an audio signal.The important features for dance performance aremusic structure, rhythm, and intensity. Regardingmusic structure, we focus on a repeating patternof the melody line, we employ similarity measure-ments independently of timbre effects proposed byLie et al.30), and we get music segments. We assignthe extracted musical rhythm and intensity featuresdescribed below to the music segments.


5.2.1 Music Rhythm FeatureTo extract music rhythm, we employ the onset

component-based rhythm estimation described inSection 3.1. After the music rhythm estimation pro-cess, the musical rhythm feature at time t is set to 1when t is music rhythm time, and set to 0 otherwise.

5.2.2 Music Intensity FeatureTo extract music intensity, we use the following:

Principle 3: The spectral power of a melody lineis likely to increase during increasing intensityin the music.

Principle 4: A melody line is likely to be per-formed using a higher range than the C4 note.

Many surveys on auditory psychology24) say thatour ears tend to recognize only the sound whosespectral power is the strongest among the neighbor-ing frequency sounds, which is often used in manyaudio signal compression algorithms such as MP3.Accordingly, for each music segment, we calculatea temporally average spectral power of each musicnote and extract their peaks to figure out which notesounds are mainly produced in the music segment.In order to extract the music intensity feature, weapproximately calculate the Sound Pressure Levelof the extracted musical note’s power, which con-siders the humans’ auditory properties and is re-lated to both the amplitude and the frequency.

5.3 Motion Synthesis Considering Motionand Music Features

The final step of our approach is to synthesizenew dance motions considering both the motion andmusic feature vectors. Figure 14 gives an overviewof our motion synthesis algorithm. From analysissteps, we have music segments with music rhythmand intensity features, and we have motion rhythmand intensity features for each motion sequence inthe database. Since rhythm is one of the most im-portant features in dance performance, we first eval-uate the similarity of the rhythm components, anddetect the candidate motion segments strongly cor-responding to each music segment. Then, we applyconnectivity analysis, which checks if synthesizedtransition motion between the neighboring motionsegments looks natural, and we extract the possi-ble sequences of motion segments. Finally, we ana-lyze the similarity of their intensity components be-tween the music segments and the selected motionsegment sequences, and we synthesize new dancemotions by connecting the motion segments witheach other.

5.3.1 Similarity Measurement of RhythmComponents

Through music analysis, we obtain music seg-ments and then the music rhythm feature and mu-

Fig. 14 Overview of our motion synthesis algorithm.

sic intensity feature for each segment. In addi-tion, we obtain a motion rhythm feature and mo-tion intensity feature through motion analysis. Theaim of this procedure is to extract candidate mo-tion segments by evaluating rhythm features. Foreach motion sequence in the database, we calcu-late the cross-correlation between motion and mu-sic rhythm features frame-by-frame, and we try tofind partial motion sequences whose rhythm is wellmatched to music rhythm via a thresholding pro-cess. The partial motion sequences can be used ascandidate motion segments for each music segment.

5.3.2 Connectivity Analysis of Motion Seg-ments

Whether or not synthesized motion looks naturaldepends strongly on connectivity analysis. In thisstep, we consider both posture similarity and move-ment similarity. Posture similarity between the can-didate motion segments is defined as the angularsimilarity of the body link direction, while move-ment similarity is calculated from the time deriva-tive of the body link directions. We use these mea-surements between the end frame of one motionsegment and the beginning frame of the neighbor-ing motion segments. From the results of the con-nectivity evaluation, we obtain the candidate se-quences of the candidate motion segments that sat-isfy the requirements for similarity with the rhythmfeatures and naturalness of the synthesized motion.

5.3.3 Similarity Measurement of IntensityComponents

Next, we evaluate the intensity components ofthe candidate sequences of the motion segmentsand input music. In order to find the semi-optimalsolution, we consider the time series of the in-tensity features as a histogram, and calculate theBhattacharyya coefficient11) to relatively evaluatethe similarity between the motion and music in-tensity histograms. The motion segment sequence


Fig. 15 The synthesis result for the dance music “Kansho-odori.”

maximizing the coefficient becomes the final resultwhich satisfies rhythm and intensity similarities.

Finally, the resulting motion sequence is acquiredby connecting the motion segments via spline-based interpolation technique.

5.4 ExperimentsWe have experimented in our proposed method

with our motion database consisting of severalJapanese dance motion sequences.

Figure 15 shows the synthesized motion for mu-sic of the Japanese dance called Kansho-odori. Fig-ure 16 shows the features of the synthesized motionand the input music. We can easily confirm thatmost of the musical rhythm is matched to the mo-tion rhythm, and the distributions of the intensitycomponents are quite similar. Some artists work-ing with the Japanese dance performance groupWarabi-za told us that our technique was quite use-ful for a performance group such as theirs to com-pose new Japanese dances and this would be a newmethod for re-use of dance motion data.

Our proposed method is applicable not only toJapanese folk dance but also to different styles ofdance such as break dancing.

6. Concluding Remarks

The ultimate goal of our research is to develop adancing entertainment robot with the achievementof the sound feedback system. For this purpose,this paper proposed methods to apply human per-ceptional models to human motion synthesis. Ourresearch is strongly motivated by the fact that mostprevious work did not consider the perceptions ofperformers, but in fact human motion is highly af-fected by these perceptions. We have developedthree fundamental methods for this purpose.

The first aspect of our proposed method, as de-scribed in Section 3, is to analyze the keyposes indance motion. We exploit the relationship betweenmotion rhythm and musical rhythm by detecting thestop motions in the dance motion data and by esti-

mating the musical rhythm itself, in the form of itsonset components. Dancers themselves have cor-roborated the results of our methods.

The second aspect of our proposed method, asdescribed in Section 4, is to model how upperbody motion can be modified depending on musi-cal speed. Research in this arena is motivated bythe observation that, as music speeds up, dancersomit details of dances to keep up with the musi-cal rhythm. Using the insights obtained throughthis observation, we modeled our proposed algo-rithm for modification of upper body motion basedon music speed.

The third aspect of our proposed method, as de-scribed in Section 5, is to synthesize dance motionusing mood features. Our algorithm is designedsuch that motion rhythm is synchronized with mu-sical rhythm, and that motion intensity is synchro-nized with musical intensity.

As future work, we will try to extend our pro-posed methods for sound feedback system of danc-ing CG characters and humanoid robots in real-time. In addition, we are thinking about devisingan evaluation method of our work via subject tests.Perceptions of human motion depend upon manycharacteristics such as geometric models6) and nat-uralness of motion after editing, and the evaluationis a very difficult problem23). We plan to start withsubject tests to verify our assumption, and improvethe proposed method.

Acknowledgements This work was supported inpart by the Japan Science and Technology Corpora-tion (JST) under the CREST project and ResearchFellowships of Japan Society for the Promotion ofScience (JSPS).

References

1) Alankus, G., Bayazit, A.A. and Bayazit, O.B.: Au-tomated Motion Synthesis for Dancing Characters,Computer Animation and Virtual Worlds, Vol. 16,No.3-4, pp.259–271 (2005).

2) Bruderlin, A. and Williams, L.: Motion Signal Pro-cessing, Proc. ACM SIGGRAPH 95, pp. 97–104(1995).

3) Flash, T. and Hogan, H.: The coordination of ArmMovements: an experimentally confirmed mathe-matical model, Journal of Neuroscience, Vol.5, pp.1688–170 (1985).

4) Gleicher, M.: Retargetting Motion to New Charac-ters, Proc. ACM SIGGRAPH 98, pp.33–42 (1998).

5) Goto, M.: An Audio-based Real-time Beat Track-ing System for Music With or Without Drum-sounds, Journal of New Music Research, Vol. 30,No.2, pp.159–171 (2001).

6) Hodgins, J.K., O’Brien, J.F. and Tumblin, J.: Per-

Vol.1 No.SIG1(CVIM1) Synthesis of Dance Performance Based on Analyses of Human Motion and Music 45

Fig. 16 The feature matching result for the music “Kansho-odori.” Yellow and lightblue lines represent motion and music rhythm components, and blue and redlines represent motion and music intensity components.

ception of Human Motion with Different GeometricModels, IEEE Trans. on Visualization and ComputerGraphics, Vol.4, No.4, pp.307–316 (1998).

7) Inamura, T., Toshima, I., Tanie, H. and Nakamura,Y.: Embodied Symbol Emergence Based on MimesisTheory, Int’l Journal of Robotics Research, Vol.23,No.4, pp.363–377 (2004).

8) Jenkins, O.C. and Mataric, M.J.: Deriving Actionand Behavior Primitives from Human Motion Data,Proc. IEEE/RSJ Int’l Conf. on Intelligent Robots andSystems, pp.2551–2556 (2002).

9) Kahol, K., Tripathi, P. and Panchanathan, S.: Ges-ture Segmentation in Complex Motion Sequences,Proc. IEEE Int’l Conf. on Image Processing, Vol.2,pp.105–108 (2003).

10) Kahol, K., Tripathi, P. and Panchanathan, S.: Doc-umenting Motion Sequences: Development of aPersonalized Annotation System, IEEE MultimediaMagazine, Vol.13, No.1, pp.37–45 (2006).

11) Kailath, T.: The Divergence and BhattacharyyaDistance Measures in Signal Selection, IEEE Trans.on Communication Technology, Vol. COM-15, pp.52–60 (1967).

12) Kim, T., Park, S. I. and Shin, S. Y.: Rhythmic-motion Synthesis based on Motion-beat Analysis,ACM Trans. on Graphics (Proc. ACM SIGGRAPH2003), Vol.22, No.3, pp.392–401 (2003).

13) Lee, H.-C. and Lee, I.-K.: Automatic Synchroniza-tion of Background Music and Motion in ComputerAnimation, Computer Graphics Forum (Proc. Euro-graphics 2005), Vol.24, No.3, pp.353–361 (2005).

14) Lee, S., Wolberg, G. and Shin, S. Y.: ScatteredData Interpolation with Multilevel B-Splines, IEEETrans. on Visualization and Computer Graphics,Vol.3, No.3, pp.228–244 (1997).

15) McCann, J., Pollard, N. S. and Srinivasa, S.:Physics-Based Motion Retiming, Proc. ACM SIG-GRAPH/Eurographics Symposium on Computer An-imation, pp.205–214 (2006).

16) Nakaoka, S., Nakazawa, A., Kanahiro, F., Kaneko,K., Morisawa, M. and Ikeuchi, K.: Task model ofLower Body Motion for a Biped Humanoid Robot toImitate Human Dances, Proc. IEEE/RSJ Int’l Conf.

on Intelligent Robots and Systems, pp. 3157–3162(2005).

17) Nakaoka, S., Nakazawa, A., Kanehiro, F., Kaneko,K., Morisawa, M., Hirukawa, H. and Ikeuchi, K.:Learning from Observation Paradigm: Leg TaskModels for Enabling a Biped Humanoid Robot toImitate Human Dances, Int’l Journal of Robotics Re-search, Vol.26, No.8, pp.829–844 (2007).

18) Nakata, T., Mori, T. and Sato, T.: Analysis of Im-pression of Robot Bodily Expression, Journal ofRobotics and Mechatronics, Vol.14, No.1, pp.27–36(2002).

19) Nakazawa, A., Nakaoka, S., Ikeuch, K. and Yokoi,K.: Imitating Human Dance Motions through Mo-tion Structure Analysis, Proc. IEEE/RSJ Int’l Conf.on Intelligent Robots and Systems, pp. 2539–2544(2002).

20) Ogawara, K., Takamatsu, J., Iba, S., Tanuki, T.,Kimura, H. and Ikeuchi, K.: Acquiring Hand-ActionModels in Task and Behavior Levels by a LearningRobot through Observing Human Demonstrations,Proc. IEEE-RAS Int’l Conf. on Humanoid Robots(2000).

21) Osaki, R., Shimada, M. and Uehara, K.: Extrac-tion of Primitive Motions by Using Clustering andSegmentation of Motion-Captured Data, Journal ofJapanese Society for Artificial Intelligence, Vol.15,No.5, pp.878–886 (2000). [in Japanese].

22) Pollard, N.S., Hodgins, J.K., Riley, M.J. and Atken-son, C. G.: Adapting Human Motion for the Con-trol of a Humanoid Robot, Proc. IEEE Int’l Conf.on Robotics and Automation, pp.1390–1397 (2002).

23) Ren, L., Patrick, A., Efros, A.A., Hodgins, J.K. andRehg, J.M.: A Data-Driven Approach to Quantify-ing Natural Human Motion, ACM Trans. on Graph-ics (Proc. ACM SIGGRAPH 2005), Vol.24, No.3, pp.1090–1097 (2005).

24) Roads, C.: The Computer Music Tutorial, The MITPress (1996).

25) Shiratori, T., Kudoh, S., Nakaoka, S. and Ikeuchi,K.: Temporal Scaling of Upper Body Motion forSound Feedback System of a Dancing HumanoidRobot, Proc. IEEE/RSJ Int’l Conf. on Intelligent


Robots and Systems (2007).26) Shiratori, T., Nakazawa, A. and Ikeuchi, K.: Detect-

ing Dance Motion Structure through Music Analy-sis, Proc. IEEE Int’l Conf. on Automatic Face andGesture Recognition, pp.857–862 (2004).

27) Shiratori, T., Nakazawa, A. and Ikeuchi, K.:Dancing-to-Music Character Animation, ComputerGraphics Forum (Proc. Eurographics 2006), Vol.25,No.3, pp.449–458 (2006).

28) Takamatsu, J., Tominaga, H., Ogawara, K., Kimura,H. and Ikeuchi, K.: Symbolic Representation of Tra-jectories for Skill Generation, Proc. IEEE Int’l Conf.on Robotics and Automation, pp.4077–4082 (2000).

29) Wang, J., Drucker, S.M., Agrawala, M. and Cohen,M. F.: The Cartoon Animation Filter, ACM Trans.on Graphics (Proc. ACM SIGGRAPH 2006), Vol.25,No.3, pp.1169–1173 (2006).

30) Wang, M., Lu, L. and Zhang, H.-J.: Repeating Pat-tern Discovery from Acoustic Musical Signals, Proc.IEEE Int’l Conf. on Multimedia and Expo, pp.2019–2022 (2004).

(Received September 10, 2007)(Accepted March 9, 2008)

(Editor in Charge: Nobutaka Shimada)

Takaaki Shiratori received aB.E. in information and communi-cation technology from The Uni-versity of Tokyo, and M.I.S.T andPh.D. both in information and com-munication technology from TheUniversity of Tokyo in 2004 and

2007 respectively. He is currently a PostdoctoralFellow in School of Computer Science of CarnegieMellon University. His research interests includehuman motion synthesis for computer animationand robotics. He is a member of the IEEE.

Katsushi Ikeuchi received aB.E. in mechanical engineeringfrom Kyoto University, Kyoto,Japan, in 1973, and Ph.D. in infor-mation engineering from The Uni-versity of Tokyo, Tokyo, Japan, in1978. He is a professor at the Inter-

faculty Initiative in Information Studies, The Uni-versity of Tokyo, Tokyo, Japan. After working atthe AI Laboratory at the Massachusetts Instituteof Technology for three years, the Electrotechni-cal Laboratory for five years, and the School ofComputer Science at Carnegie Mellon Universityfor 10 years, he joined The University of Tokyo in1996. He was selected as a Distinguished Lecturerof the IEEE Signal Processing Society for the pe-riod of 2000–2001, and a Distinguished Lecturer ofthe IEEE Computer Society for the period of 2004–2006. He has received several awards, includingthe David Marr Prize in ICCV 1990, IEEE R&AK–S Fu Memorial Best Transaction Paper Awardin 1998, and best paper awards in CVPR 1991,VSMM 2000, and VSMM 2004. In addition, in1992, his paper, “Numerical Shape from Shadingand Occluding Boundaries,” was selected as oneof the most influential papers to have appeared inthe Artificial Intelligence Journal within the past 10years. He is a fellow of the IEEE.

synthesis of dance performance based on analyses of human ...siratori/pub/tcvim2008.pdf · features...

Documents