arxiv:2005.12155v3 [cs.cv] 13 dec 2020

7
DeepSSM: Deep State-Space Model for 3D Human Motion Prediction Xiaoli Liu 1 , Jianqin Yin 1 , Huaping Liu 2 and Jun Liu 3 Abstract— Predicting future human motion plays a significant role in human-machine interactions for a variety of real-life applications. In this paper, we build a deep state-space model, DeepSSM, to predict future human motion. Specifically, we formulate the human motion system as the state-space model of a dynamic system and model the motion system by the state- space theory, offering a unified formulation for diverse human motion systems. Moreover, a novel deep network is designed to build this system, enabling us to utilize both the advantages of deep network and state-space model. The deep network jointly models the process of both the state-state transition and the state-observation transition of the human motion system, and multiple future poses can be generated via the state- observation transition of the model recursively. To improve the modeling ability of the system, a unique loss function, ATPL (Attention Temporal Prediction Loss), is introduced to optimize the model, encouraging the system to achieve more accurate predictions by paying increasing attention to the early time-steps. The experiments on two benchmark datasets (i.e., Human3.6M and 3DPW) confirm that our method achieves state-of-the-art performance with improved effectiveness. The code will be available if the paper is accepted. I. INTRODUCTION Humans’ recognition and interaction with the real world relies on the ability to predict their surrounding changes over time [1]. Similarly, intelligent robots that interact with people must have the ability to predict the future dynamics of humans, enabling the robots to respond rapidly to human changes [2], [3], [4]. In this paper, as shown in Figure 1(a), we focus on predicting human motion with 3D joint position data, aiming to predict the future poses based on the observed poses. The human motion system is a typical dynamic system, and formulating it as a state-space model of the dynamic system benefits from a major advantage, i.e., the state- space model provides a uniform formalization for a series of diverse human motion systems [5], [6]. “State” and “Observation” are two key concepts for the state-space model, i.e., the state reveals the internal state of the system, and the observation reflects the output of the system. And *This work was supported partly by the National Natural Science Foundation of China (Grant No. 61673192), the Fundamental Research Funds for the Central Universities (Grant No. 2020XD-A04-1, 2019RC27) 1 Xiaoli Liu and Jianqin Yin are with the School of Artificial Intelligence of Beijing University of Posts and Telecommunications, No.10 Xitucheng Road, Haidian District, Beijing 100876, China. E-mail: [email protected], [email protected] 2 Huaping Liu is with the Department of Computer Science and Technology of Tsinghua University, Beijing 100084, China. E-mail: [email protected] 3 Jun Liu is with the Department of Mechanical Engineering, City University of Hong Kong. 83 Tat Chee Ave., Kowloon, Hong Kong SAR, China. E-mail: [email protected]. Jianqin Yin and Jun Liu are the corresponding authors. (a) 0 10 20 -0.8 -0.5 0 0.5 0.8 Coodinate value Frames x 0 10 20 -1 -0.5 0 0.5 1 Frames y 0 10 20 Frames 2.7 2.8 2.9 3 z Right hand Left hand Trunk Right leg Left leg (b) Fig. 1. Human motion prediction. (a) Human motion prediction. (b) Joint trajectories along each coordinate (i.e. x, y and z), where the horizontal axis denotes “Frames”, the vertical axis denotes “Joints”, one curve denotes one joint trajectory along one coordinate and joint trajectories of the same part are marked with the same color. the process of the model includes “state-state transition” and “state-observation transition”. The state-state transition reflects the changes of the internal state of the system, and the state-observation transition reveals the correlations between the output and the internal state of the system. For the human motion prediction system, the positions and velocities of moving poses can be incorporated as the observation, and the future poses can be generated via state-observation transition. Moreover, the deep network has shown its strong representation ability in many computer vision tasks [7], [8], [9]. Therefore, in this paper, we develop a deep state- space model, DeepSSM, for predicting future human motion, making full use of the merits of both the deep network and the state-space model. Specifically, we initialize the state of the model by utilizing a deep network to extract the motion dynamic law of human motion at both the coordinate level and the joint level shown in Fig. 1(b). Then, multiple future poses can be predicted via state-observation transition recursively. The proposed deep state-space model is also general and powerful for providing the theoretical basis for the analysis and interpretability of other existing human motion prediction models [10], [11], [12]. Multi-order information, including positions, velocities, etc, carries rich motion dynamics that are useful for pre- dicting human motion. For example, using current positions and velocities of moving poses, his/her pose at the next time- step can be easily determined [13], [14]. Most of the existing literature focused on modeling the zero-order information (e.g. joint positions) of the pose and neglected the impor- tance of learning high-order information such as velocities [15], [16]. Recently, some works noticed the importance of velocities, and they implicitly model the future velocities as the internal state of the human motion system via residual connection in the decoder [17], [18], [19]. Unlike prior work, the position and velocities are incorporated as the observation of the state-space models, directly representing the output of the system. Using this setting, the second-order information, arXiv:2005.12155v3 [cs.CV] 13 Dec 2020

Upload: others

Post on 19-Jan-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

DeepSSM: Deep State-Space Model for 3D Human Motion Prediction

Xiaoli Liu1, Jianqin Yin1, Huaping Liu2 and Jun Liu3

Abstract— Predicting future human motion plays a significantrole in human-machine interactions for a variety of real-lifeapplications. In this paper, we build a deep state-space model,DeepSSM, to predict future human motion. Specifically, weformulate the human motion system as the state-space modelof a dynamic system and model the motion system by the state-space theory, offering a unified formulation for diverse humanmotion systems. Moreover, a novel deep network is designed tobuild this system, enabling us to utilize both the advantagesof deep network and state-space model. The deep networkjointly models the process of both the state-state transition andthe state-observation transition of the human motion system,and multiple future poses can be generated via the state-observation transition of the model recursively. To improvethe modeling ability of the system, a unique loss function,ATPL (Attention Temporal Prediction Loss), is introduced tooptimize the model, encouraging the system to achieve moreaccurate predictions by paying increasing attention to the earlytime-steps. The experiments on two benchmark datasets (i.e.,Human3.6M and 3DPW) confirm that our method achievesstate-of-the-art performance with improved effectiveness. Thecode will be available if the paper is accepted.

I. INTRODUCTION

Humans’ recognition and interaction with the real worldrelies on the ability to predict their surrounding changesover time [1]. Similarly, intelligent robots that interact withpeople must have the ability to predict the future dynamicsof humans, enabling the robots to respond rapidly to humanchanges [2], [3], [4]. In this paper, as shown in Figure 1(a),we focus on predicting human motion with 3D joint positiondata, aiming to predict the future poses based on the observedposes.

The human motion system is a typical dynamic system,and formulating it as a state-space model of the dynamicsystem benefits from a major advantage, i.e., the state-space model provides a uniform formalization for a seriesof diverse human motion systems [5], [6]. “State” and“Observation” are two key concepts for the state-spacemodel, i.e., the state reveals the internal state of the system,and the observation reflects the output of the system. And

*This work was supported partly by the National Natural ScienceFoundation of China (Grant No. 61673192), the Fundamental ResearchFunds for the Central Universities (Grant No. 2020XD-A04-1, 2019RC27)

1Xiaoli Liu and Jianqin Yin are with the School of ArtificialIntelligence of Beijing University of Posts and Telecommunications,No.10 Xitucheng Road, Haidian District, Beijing 100876, China. E-mail:[email protected], [email protected]

2Huaping Liu is with the Department of Computer Science andTechnology of Tsinghua University, Beijing 100084, China. E-mail:[email protected]

3Jun Liu is with the Department of Mechanical Engineering, CityUniversity of Hong Kong. 83 Tat Chee Ave., Kowloon, Hong Kong SAR,China. E-mail: [email protected].

Jianqin Yin and Jun Liu are the corresponding authors.

(a)

0 10 20

-0.8

-0.5

0

0.5

0.8

Coo

dina

te v

alue

Frames

x

0 10 20-1

-0.5

0

0.5

1

Frames

y

0 10 20Frames

2.7

2.8

2.9

3z

Right handLeft handTrunkRight legLeft leg

(b)

Fig. 1. Human motion prediction. (a) Human motion prediction. (b) Jointtrajectories along each coordinate (i.e. x, y and z), where the horizontalaxis denotes “Frames”, the vertical axis denotes “Joints”, one curve denotesone joint trajectory along one coordinate and joint trajectories of the samepart are marked with the same color.

the process of the model includes “state-state transition”and “state-observation transition”. The state-state transitionreflects the changes of the internal state of the system, and thestate-observation transition reveals the correlations betweenthe output and the internal state of the system. For thehuman motion prediction system, the positions and velocitiesof moving poses can be incorporated as the observation,and the future poses can be generated via state-observationtransition. Moreover, the deep network has shown its strongrepresentation ability in many computer vision tasks [7],[8], [9]. Therefore, in this paper, we develop a deep state-space model, DeepSSM, for predicting future human motion,making full use of the merits of both the deep network andthe state-space model. Specifically, we initialize the stateof the model by utilizing a deep network to extract themotion dynamic law of human motion at both the coordinatelevel and the joint level shown in Fig. 1(b). Then, multiplefuture poses can be predicted via state-observation transitionrecursively. The proposed deep state-space model is alsogeneral and powerful for providing the theoretical basis forthe analysis and interpretability of other existing humanmotion prediction models [10], [11], [12].

Multi-order information, including positions, velocities,etc, carries rich motion dynamics that are useful for pre-dicting human motion. For example, using current positionsand velocities of moving poses, his/her pose at the next time-step can be easily determined [13], [14]. Most of the existingliterature focused on modeling the zero-order information(e.g. joint positions) of the pose and neglected the impor-tance of learning high-order information such as velocities[15], [16]. Recently, some works noticed the importance ofvelocities, and they implicitly model the future velocities asthe internal state of the human motion system via residualconnection in the decoder [17], [18], [19]. Unlike prior work,the position and velocities are incorporated as the observationof the state-space models, directly representing the output ofthe system. Using this setting, the second-order information,

arX

iv:2

005.

1215

5v3

[cs

.CV

] 1

3 D

ec 2

020

i.e., acceleration, is also incorporated into the system.Most of the prediction models optimized their models

simply using L2 [15], [17], [20] or MPJPE (Mean Per JointPosition Error) [16], [21] loss over all frames or all jointsof future poses. Since the loss value at the early time-stepis smaller than that at the later time-steps, these modelimplicitly focused on the predictions of later time-steps,ignoring the relationship between the early time-steps andthe later time-steps, i.e., the early predictions are prone toaffect the predictions of the later time-steps in the recursivemodel. Therefore, these models failed to achieve accuratepredictions, especially in the recursive prediction model. Toaddress this problem, we propose to pay more attention tothe predictions of early time-steps, encouraging the networkto achieve more accurate predictions at the early time-steps.

Our main contribution can be summarized as follows. (1)We formulate the human motion system under the state-space theory of the dynamic system, providing a unifiedformulation for the existing human motion systems. (2) Webuild a deep state-space model, DeepSSM, that utilizes thedeep network to solve the state-space model based motionprediction. The proposed model utilizes the deep network tojointly learn the two major transition functions of the state-space model, i.e., state-state transition and state-observationtransition, combining both the strong representation abilityof deep network and the merits of uniform formulationand multi-order modeling of the state-space model. (3) Anattention-guided loss, ATPL, is designed to optimize the pro-posed model with increasing weights to the early predictions,guiding the model to predict more accurate predictions at theearly time-steps.

II. RELATED WORK

A. Human motion prediction

Human motion data is a typical type of time-seriesdata and RNN (Recurrent Neural Network) has shown itsstrong ability on processing time-series data, and therefore,a natural type of method was proposed based on RNNfor predicting future human motion [20], [22], [23], [24].For example, Fragkiadaki et al. [20] proposed an encoder-recurrent-decoder (ERD) model by incorporating a nonlinearencoder and decoder before and after LSTM cells, and futureposes were recursively predicted via state-state transitionsby the inherent recurrent units of LSTM cells. Due to theerror accumulation inherently in RNN, these models easilyconverged to the performance of mean poses [11], [25], [26].Moreover, human movements are constrained by the physicalstructure of the human body. Traditional RNN models maynot capture the correlations among joints of the humanbody well. Therefore, other RNN models incorporated someskeletal representations such as Lie algebra representation tomodel the correlations among joints [24], [25]. Liu et al.[24] proposed a novel model, HMR (Hierarchical MotionRecurrent), to anticipate future motion sequences. The au-thors modeled the global and local motion contexts by usingLSTM hierarchically and captured the spatial correlations

by representing the skeletal frames with the Lie algebrarepresentation.

Another type of methods for predicting future humanmotion built with deep feedforward networks [16], [27]. Forexample, Butepage et al. [27] learned a generic represen-tation from the input Cartesian skeletal data and predictedfuture 3D poses using feedforward neural networks. Mao etal. [16] proposed a feedforward model for predicting future3D poses and also achieved state-of-the-art performance.The authors modeled the temporal dependencies of humanmotion using DCT (Discrete Cosine Transform) and capturedthe spatial structure information of the human body byrepresenting the joints of the human body as a graph usingGCN (Graph Convolutional Network).

Positions and velocities jointly determine the state of thehuman body at the next time-step to a great extent. However,most of the above-mentioned models focused on modelingthe positions of the human body and ignored the modeling ofvelocities. Recent work noticed the importance of modelingvelocities of human motion [17], [18], [12]. Martinez et al.[17] and Li et al. [18], [12] introduced a residual connectionbetween the input and output of the decoder to implicitlymodel the velocities as the internal state of their humanmotion system. Chiu et al. [28] also predicted future poses bymodeling the velocities of human motion, the authors ignoredthe modeling of positions. Different from these prior work,we incorporate the positions and velocities as the observationand directly model positions and velocities of the humanbody via state-observation transition of our proposed deepstate-space model.

B. State-space model for time-series problem

The state-space model offers a unified formulation for aseries of time-series models, e.g., human motion systems [5],[6], [29], and it can also be applied to analyze the existingsequence models [27], [22]. For example, Karl et al. [29]proposed Deep Variational Bayes Filters under assumptionsof the latent state-space model for reliable system identi-fication, and this model also potentially provided systemtheory for downstream tasks. [6] and [30] also proposeda deep state-space model for sequence tasks, utilizing theadvantages of both deep network and state-space models. Inthis paper, we formulate the problem of human motion bythe deep state-space model, providing unified formulationsfor diverse human motion systems. In contrast to [6] and[30], our model focuses on predicting future human motion,while [30] focused on probabilistic time series forecastingand [6] focused on recognizing the action label of skeletalsequences.

III. METHODOLOGY

A. Skeletal Representation

Given an input sequence S1 = {p(t0)} (t0 = −(T1 −1), · · · ,−1, 0) with length T1, where p(t0) denotes thepose of sequence S1 at the t0-th time-step. The velocitiesof sequence S1 can be defined as V1 = {v(t0)}, wherev(t0) = p(t0)− p(t0 − 1) and v(−(T1 − 1)) = {0}. In this

paper, we introduce a skeletal representation to better capturethe dynamic features by representing the input sequencein both position space and velocity space. In the positionspace, as shown in Figure 1(b), since motion trajectoriesalong each coordinate vary greatly, the input sequence S1

is represented by three 2D tensors, including S1x, S1y andS1z , to conveniently capture the coordinate-level features,representing trajectories of the sequence S along the x, y,and z axes, respectively. Similarly, in the velocity space, V1is represented by three 2D tensors, including V1x,V1y andV1z , representing the velocities information along the x, yand z axes, respectively. In our skeletal representation, thewidth denotes frames, the height denotes joints, and the orderof joints is consistent with [31] to conveniently capture thelocal characteristic of the human body [31], [32].

B. Problem formulation

The human motion system is a typical dynamic system,which can be represented by the state-space model of thedynamic system as equations 1 and 2 [13], [14].

I(t+ 1) = f1(I(t), t) + a(t) (1)

O(t) = f2(I(t), t) + b(t) (2)

Where I(t) and O(t) are state and observation at time t,respectively; a(t) and b(t) are the process and measurementnoise, respectively; f1(·) and f2(·) denote the system func-tions.

In this paper, the positions and velocities of human motionare incorporated as the observation, the motion dynamiclaw of a series of historical poses is set to the state ofthe state-space model, and a(t) and b(t) are initialized to0, respectively. The corresponding future sequence of S1 isdefined as: S2 = {p(t)} with length T2, and its velocitiesare {v(t)} (t = 1, 2, · · · , T2) , where p(t) denotes the t-thpose of sequence S2, and v(t) denotes the velocity at the t-thtime-step. Therefore, the state I(t) and observation O(t) canbe formulated as equations 3 and 4, respectively.

I(t) = {p(t− 1), v(t− 1), F(t− 1)} (3)

O(t) = {p(t), v(t)} (4)

where O(0) is initialized to {p(0), 0}, and F(t− 1) denotesother multi-order information of historical poses at the (t−1)-th time-step.

The state-space model can be considered as a two-stagesystem, including state-state transition and state-observationtransition. (1) State-state transition: this stage is to update thesystem state through the system function f1(·) by generatingthe multi-order information of future poses and updatingother multi-order information of previous poses, which canbe learned by our proposed network automatically. (2) State-observation transition: this stage is to calculate the observa-tion by the system function f2(·) from the current state of thesystem, which can be learned by our decoder automatically.Moreover, current positions and velocities of the human bodycan determine the positions of the human body at the next

time-step. Therefore, the future poses can be calculated byequation 5.

p(t) = p(t− 1) + v(t− 1) (5)

C. Deep State-Space Model

The architecture of DeepSSM is shown in Figure 2,including state initialization, state transition, and loss. Thestate initialization aims to initialize the state of the systemas the motion dynamic law of the input poses, and the statetransition is to update the state of the system and generatefuture poses.

1) Backbone layer.: Inspired by [33], as shown in Figure3, we propose a new backbone layer, Densely ConnectedConvolutional Module (DCCM), to maximize the informa-tion flow propagation layer by layer, which mainly consistsof 5 convolutional layers. At each convolutional layer, theinput receives the enhanced features by fusing the concate-nated feature maps from all preceding layers using a 1 × 1convolution. Here, the joint-level features can be learned bythe 1×1 convolutions of the residual connections. Therefore,the dense residual connections in DCCM allow the networkto gradually obtain the enhanced features of deep layers byfusing the joint-level features of shallow layers which canbe formulated as equation 6.

Ml = Hb([g0(M0), g1(M1), · · · , gl−2(Ml−2),Ml−1])(6)

Where Ml (l = 1, 2, · · · , 5) denotes the output featuremap of the l-th layer shown in Figure 3, Hb(·) denotes thefusion layer built with the operation of concatenation acrosschannel followed by a 1×1 convolution, g(·) denotes a 1×1convolutional layer followed by an activation function (i.e.Leaky ReLU).

2) State initialization: Based on the skeletal represen-tation described above, as shown at the left of Figure 2,multi-branches networks are built with DCCMs to encode themotion dynamics of the input sequence at both the coordinatelevel and the joint level in position space and velocity space,which mainly consists of pose branch, velocity branch, and afusion module. (1) At the pose branch, S1x, S1y and S1z arefed into each sub-branch built with 2 DCCMs, respectively,which enables the network to capture the coordinate-levelfeatures. Then, one DCCM is applied to obtain the joint-level features by fusing the coordinate-level features. Here,all sub-branches are shared weights to reduce the modelcomplexity and also model correlations among x, y, andz coordinates. (2) At the velocity branch, similar to thatof the pose branch, V1x, V1y and V1z are fed into eachsub-branch respectively to gradually capture the multi-levelfeatures of the input sequence in the velocity space. (3) Thefusion module aims to fuse the motion dynamics captured bythe pose branch and the velocity branch, which is built withthe operation of concatenation along channel followed by aconvolutional layer and Leaky ReLU. To sum up, the outputof this section describes the motion dynamic features of theinput sequence, and we mark it as h(0) shown in Figure 2.

DC

CM

Time

Joints

DC

CM

DC

CM

Time

Joints

y

DC

CM

DC

CM

Time

Jointsz

DC

CM

Con

cat

DC

CM

DC

CM

Time

Joints

DC

CM

DC

CM

Time

Joints

yD

CC

M

DC

CM

Time

Jointsz

DC

CM

Con

cat

DC

CM

Pose

branch

x

x

Velocity

branch

Con

cat

Con

v, 3×

3,

64

Lea

ky_

ReL

U

Conv, 3×3, 64

Leaky_ReLU

Conv, 3×3, 64

Leaky_ReLU

Conv, 3×3, 64

Leaky_ReLU

Conv, 3×3, 32

Leaky ReLU

Flatten

FC

(0)v

(0)p(1)v

Conv, 3×3, 64

Leaky_ReLU

Conv, 3×3, 64

Leaky_ReLU

Conv, 3×3, 64

Leaky_ReLU

Conv, 3×3, 32

Leaky ReLU

Flatten

FC

(1)v

(2)v

Conv, 3×3, 64

Leaky_ReLU

Conv, 3×3, 64

Leaky_ReLU

Conv, 3×3, 64

Leaky_ReLU

Conv, 3×3, 32

Leaky ReLU

Flatten

FC

Decoder

2( 1)v T −

2( )v T

……

……

State initializationvL

DecoderDecoder

1xS

1yS

1zS

1xV

1yV

1zV

Fusion module

(0)h

pLState transition

Fig. 2. The architecture of DeepSSM, where DCCM denotes our proposed building block shown in Figure 3, (Conv, 3 × 3, N ) denotes a 3 × 3convolutional layer with N output channels.

Con

cat

Input

Con

cat

Output

Layer1 Layer2

Con

v, 3

3, 64

Leak

y_

ReL

U

Dro

pou

t, 0

.75

0M

1M

2M

3M

4M

5M

Con

cat

Con

cat

Layer3 Layer4 Layer5

Con

v, 1

1, 64

Leak

y_

ReL

U

Fig. 3. Densely Connected Convolutional Module (DCCM).

Therefore, F(0) is initialized to h(0), and the initial state ofthe system I(0) is set to {F(0)}.

3) State transition: At the left of Figure 2, a recursivedecoder built with CNN is proposed to achieve state tran-sition, including state-state transition and state-observationtransition. For each decoder, because the historical informa-tion has a more complex structure, while the current velocityis relatively simple. Therefore, more operations are appliedto the historical features, and fewer operations are appliedto the current velocity, and then this information is fused byelement-wise summation. Finally, another one convolutionallayer and FC layer (i.e. fully connected layer) are applied topredict future velocity.

The state of the dynamic system I(t+1) and the observa-tion O(t) can be updated by equations 7 and 8, respectively.

I(t+ 1) = f1(I(t), t) (7)

O(t) = f2(I(t), t) (8)

Where t = 1, 2, · · · , T2, f1(·) and f2(·) denote two learnablemappings via our proposed model. Here, F(t) lied in I(t+1)denotes the input state of the decoder at the t-th time-step,capturing multi-order information of previous time-steps. Asshown in Figure 2, h(t) denotes the intermediate outputof decoder, capturing historical motion dynamic features ofprevious poses. Due to the chain structure of the decoder, thehistorical features of the earlier poses will fade away overtime. To memorize the long-term dependencies for long-termprediction, F(t) can be updated by sparsely aggregating thehistorical motion dynamic features of previous poses withh(t) according to equation 9.

4) Model optimization: To model multi-order informationof future poses and achieve more accurate predictions, asshown in Figure 2, our loss L consists of two parts: Lv andLp, which can be formulated as: L=λ1Lv+λ2Lp, where λ1and λ2 are two hyperparameters to balance Lv and Lp. (1)Lv guides the network to decode future velocities; (2) Lpencourages the network to restore future positions.

To avoid error accumulation in a recursive model, wepropose an attention temporal prediction loss (ATPL) toguide the model to achieve more accurate predictions bypaying increasing attention to the early predictions. Take Lvas an example, Lv can be defined as equation 10.

Lv =1

Nj

T2∑t=1

αt

Nj∑j=1

∥∥∥J tj − J tj∥∥∥2 (10)

Where Nj denotes the number of joints, T2 is the numberof future poses, J tj and J tj denote the groundtruth and thepredictive joint in the velocity space, respectively. αt denotesthe attention weight at the t-th time-step, and αt > αt+1

that forces the network to pay more attention to the earlypredictions. Here, αt is initialized to 2(T2− t+1), and thennormalized to 1 by αt = αt∑T2

k=1 αk

.Similar to Lv , Lp can be calculated according to equation

10 in the position space.

IV. EXPERIMENTS

A. Datasets and Implementation Details

Datasets: (1) Human3.6M (H3.6M) [21]: H3.6M is themost common used dataset for human motion prediction.The dataset consists of 15 actions performed by sevenprofessional actors, such as walking, eating, smoking andDiscussion. (2) 3D Pose in the Wild dataset (3DPW)[34]: 3DPW is a dataset in the wild with accurate 3D posesperforming various activities such as shopping, doing sports.The dataset includes 60 sequences, more than 51k frames.

Implementation Details: In experiments, all experimentalsettings and data processing are consistent with the baselines[15], [16], [17]. Our model is implemented by TensorFlow.

F(t) =

{h(t), t = 1 or (t mod 2) = 1

Hm([h(1), h(3), · · · , h(t− 1), h(t) + h(0)]), otherwise(9)

Where Hm(·) denotes a memory module built with the operation of concatenation across channel followed by two 3 × 3convolutional layers.

MPJPE (Mean Per Joint Position Error) proposed in [21] inmillimeter is used as our metric to evaluate the performanceof our proposed method. All models are trained with Adamoptimizer, and the learning rate is initialized to 0.0001. λ1 :λ2 is set to 3:1.

B. Comparison with state-of-the-arts

Baselines: (1) RGRU [17] is built entirely based on GRUs,and uses residual connections to implicitly predict the futurevelocities. (2) CS2S [15] is a CNN-based feedforward modeland predicts multiple poses recursively. (3) DTraj [16] isthe current state-of-the-art method for 3D human motionprediction, which is built with DCT and GCN.

Results on H3.6M: Table I reports the results for bothshort-term and long-term prediction on H3.6M. Our methodoutperforms all baselines for both short-term and long-termpredictions at all time-steps, showing the effectiveness ofour proposed deep state-space model. Specifically, comparedwith the RNN baseline [17], the errors of our methoddecrease significantly. The possible reason is that our methoddirectly models both the positions and velocities of thehuman body as the observation of our deep state-spacemodel, while [17] implicitly modeled the velocities as theinternal state of the system using residual connections inthe decoder and ignored a part of spatial features amongjoints of the human body using GRUs. Compared with otherfeedforward baselines [15], [16], our model also achieves thebest results in most cases. This benefits from two folds: (1)our method explicitly models the positions and velocities asthe observation of the proposed state-space model, while thebaselines [15], [16] easily suffered from the limited abilityon velocity modeling via the residual connection betweenthe input and output of their decoder. So that our methodcan better capture the motion dynamics of human motionusing velocity learning. (2) Our recursive decoder enablesour network to achieve more accurate predictions by makingfull use of the predictions at the early time-steps, while [16]predicted the future poses in a non-recursive manner.

The qualitative results on H3.6M are also provided in Fig-ures 4 and 5. Compared with [16], our method also achievesthe best visualization performance for both short-term andlong-term predictions, demonstrating the effectiveness of ourmethod again. Specifically, for the left hand of Figure 4, theupper limbs of Figure 5(a), and the right hand of Figure 5(b),the results of our method are better than that of [16]. Themain reasons are two folds: (1) the superior performancegreatly benefits our proposed deep state-space model byutilizing both merits of deep network and state-space models.(2) Our model incorporates both positions and velocities asthe observation and designs a novel temporal loss to en-

(a) Greeting

(b) Eating

Fig. 4. Qualitative results for short-term prediction on H3.6M. The blackposes denote the groundtruth, the blue poses denote the results of [16], andthe red poses denote the results of our method (This is the same in Figure5).

(a) Sittingdown

(b) Takingphoto

Fig. 5. Qualitative results for long term prediction on H3.6M.

courage the model to achieve accurate predictions. However,[16] ignored the modeling of velocities and discarded thepredictions at the early time-steps. Consequently, our modelcan better capture motion dynamics for accurate predictions.

Results on 3DPW: Table II reports the results for short-term and long-term prediction on 3DPW. In general, ourmethod consistently outperforms the baselines at all time-steps for both short-term and long-term predictions, whichfurther verifies the effectiveness of our proposed DeepSSMon modeling motion dynamics for accurate predictions.

C. Ablation analysis

In this section, we conduct ablation experiments to showthe effectiveness of several components in DeepSSM, andthe results are reported in Table III.

(1) Multi-order information: a) the experiments of “#2”,“#3”, and “#9” show the effectiveness of modeling multi-order information of the input sequence. Compared with“#9”, the errors of “#2” and “#3” increase at all time-steps, showing the effectiveness of capturing motion dynam-ics using multi-order information from both velocities andpositions. Without modeling the velocities of input poses,

TABLE ISHORT AND LONG-TERM PREDICTION ON H3.6M. WHERE “MS” DENOTES “MILLISECONDS”.

ms Walking Eating Smoking Discussion80 160 320 400 560 1000 80 160 320 400 560 1000 80 160 320 400 560 1000 80 160 320 400 560 1000

RGRU[17] 23.8 40.4 62.9 70.9 73.8 86.7 17.6 34.7 71.9 87.7 101.3 119.7 19.7 36.6 61.8 73.9 85.0 118.5 31.7 61.3 96.0 103.5 120.7 147.6CS2S[15] 17.1 31.2 53.8 61.5 59.2 71.3 13.7 25.9 52.5 63.3 66.5 85.4 11.1 21.0 33.4 38.3 42.0 67.9 18.9 39.3 67.7 75.7 84.1 116.9DTraj[16] 8.9 15.7 29.2 33.4 42.2 51.3 8.8 18.9 39.4 47.2 56.5 68.6 7.8 14.9 25.3 28.7 32.3 60.5 9.8 22.1 39.6 44.1 70.4 103.5Ours 7.6 15.6 30.2 34.9 35.7 48.1 7.8 15.9 33.9 42.5 58.5 71.4 6.4 13.1 24.2 29.6 33.0 57.3 8.6 21.3 37.9 43.4 70.6 109.2

ms Directions Greeting Phoning Posing80 160 320 400 560 1000 80 160 320 400 560 1000 80 160 320 400 560 1000 80 160 320 400 560 1000

RGRU[17] 36.5 56.4 81.5 97.3 — — 37.9 74.1 139.0 158.8 — — 25.6 44.4 74.0 84.2 — — 27.9 54.7 131.3 160.8 — —CS2S[15] 22.0 37.2 59.6 73.4 — — 24.5 46.2 90.0 103.1 — — 17.2 29.7 53.4 61.3 — — 16.1 35.6 86.2 105.6 — —DTraj[16] 12.6 24.4 48.2 58.4 85.8 109.3 14.5 30.5 74.2 89.0 91.8 87.4 11.5 20.2 37.9 43.2 65.0 113.6 9.4 23.9 66.2 82.9 113.4 220.6Ours 9.7 21.8 47.1 57.5 81.3 104.5 12.5 27.4 68.0 82.8 93.2 89.5 10.6 19.2 35.7 42.5 65.0 113.7 7.3 21.3 63.9 80.1 115.7 210.3

ms Purchases Sitting SittingDown TakingPhoto80 160 320 400 560 1000 80 160 320 400 560 1000 80 160 320 400 560 1000 80 160 320 400 560 1000

RGRU[17] 40.8 71.8 104.2 109.8 — — 34.5 69.9 126.3 141.6 — — 28.6 55.3 101.6 118.9 — — 23.6 47.4 94.0 112.7 — —CS2S[15] 29.4 54.9 82.2 93.0 — — 19.8 42.4 77.0 88.4 — — 17.1 34.9 66.3 77.7 — — 14.0 27.2 53.8 66.2 — —DTraj[16] 19.6 38.5 64.4 72.2 94.3 130.4 10.7 24.6 50.6 62.0 79.6 114.9 11.4 27.6 56.4 67.6 82.6 140.1 6.8 15.2 38.2 49.6 68.9 87.1Ours 17.9 39.2 64.8 75.8 85.9 120.5 9.7 23.3 48.2 61.6 82.6 116.4 10.3 26.2 51.7 61.3 79.0 131.2 5.2 14.2 38.9 49.9 68.7 86.8

ms Waiting WalkingDog WalkingTogether Average80 160 320 400 560 1000 80 160 320 400 560 1000 80 160 320 400 560 1000 80 160 320 400 560 1000

RGRU[17] 29.5 60.5 119.9 140.6 — — 60.5 101.9 160.8 188.3 — — 23.5 45.0 71.3 82.8 — — 30.8 57.0 99.8 115.5 — —CS2S[15] 17.9 36.5 74.9 90.7 — — 40.6 74.7 116.6 138.7 — — 15.0 29.9 54.3 65.8 — — 19.6 37.8 68.1 80.2 — —DTraj[16] 9.5 22.0 57.5 73.9 100.9 167.6 32.2 58.0 102.2 122.7 136.6 174.3 8.9 18.4 35.3 44.3 57.0 85.0 12.1 25.0 51.0 61.3 78.5 114.3Ours 8.1 20.3 52.3 67.0 90.2 162.7 21.9 48.9 89.8 105.4 139.8 191.8 6.8 15.7 31.7 41.1 58.0 77.7 10.0 22.9 47.9 58.4 77.1 112.7

TABLE IISHORT AND LONG-TERM PREDICTION ON 3DPW.

Milliseconds 200 400 600 800 1000RGRU[17] 113.9 173.1 191.9 201.1 210.7CS2S[15] 71.6 124.9 155.4 174.7 187.5DTraj[16] 35.6 67.8 90.6 106.9 117.8Ours 30.3 62.1 88.4 102.1 110.0

the errors of “#3” significantly increase at the early time-steps, showing the importance of modeling velocities forthe short-term prediction. b) The experiments of “#4”, “#5”,and “#9” show the effectiveness of modeling multi-orderinformation of the future poses. Similarly, the errors of “#4”,and “#5” are larger than that of “#9” at all time-steps.Combining Lp and Lv can guide the network to model multi-order information of future poses, leading to better results.Moreover, the errors of “#5” increase significantly, showingthe importance of modeling the velocities of future poses.c) The experiments of “#7”, “#8”, and “#9” further showthe effectiveness of incorporating both the positions andvelocities as the observation of the deep state-space model.Without modeling positions or velocities will reduce theperformance of the system, especially modeling velocities,showing the importance of velocity learning.

In conclusion, incorporating the positions and velocitiesof the human body enables the deep model to learn a robustmotion dynamic law for representing the internal state of thedeep state-space model and further improve the ability of thesystem for accurate predictions.

(2) Multi-level features: compared the errors between #1and #9, the errors of “#1” increase at all time-steps, showingthe importance of modeling motion dynamics at both thecoordinate level and the joint level.

(3) ATPL: compared with “#6”, the errors of “#9” de-

TABLE IIIABLATION RESULTS ON H3.6M. WHERE “MS” DENOTES

“MILLISECONDS”, “xyz” DENOTES “MODELING THE

COORDINATE-LEVEL FEATURES OF POSES”, “PB” DENOTES “POSE

BRANCH”, “VB” DENOTES “VELOCITY BRANCH”.

# State initialization Loss80ms 160ms 320ms 400ms Average

xyz pb vb Lp Lv ATPL1 # 10.1 23.2 49.2 60.1 35.72 # 9.9 23.4 50.3 61.7 36.33 # 10.7 23.9 49.0 59.4 35.84 # 10.0 23.0 48.4 59.1 35.15 # 10.2 23.5 49.3 60.1 35.86 # 10.6 23.7 48.8 59.2 35.67 # # 10.1 24.0 51.6 62.4 37.08 # # 13.7 27.6 55.5 66.8 40.99 10.0 22.9 47.9 58.4 34.8

crease significantly, especially at the later time-steps. ATPLwith increasing attentions for the early predictions guides thenetwork to predict more accurate results at the early time-steps and thus can potentially mitigate the error accumulationand further enhance the overall performance.

V. CONCLUSIONS

In this paper, we reformulate the human motion systemas a deep state-space model by utilizing both the meritsof deep representation and the state-space model, providinga unified formulation for various human motion systems,and be also used to analyze the prior models. Furthermore,an end-to-end feedforward network is presented to buildthis model, jointly achieving state initialization and statetransition of the system. What’s more, the proposed ATPLcan effectively guide the recursive model to achieve moreaccurate predictions. Finally, we evaluate our model on twochallenging datasets, and our model achieves state-of-the-art

performance. The experiments also show that the coordinate-level features of human motion can further improve theperformance of the system.

REFERENCES

[1] Y. Kong and Y. Fu, “Human action recognition and prediction: Asurvey,” arXiv preprint arXiv:1806.11230, 2018.

[2] E. Aksan, M. Kaufmann, and O. Hilliges, “Structured prediction helps3d human motion modelling,” in The IEEE International Conferenceon Computer Vision (CVPR), 2019, pp. 7144–7153.

[3] Y. T. Xu, Y. Li, and D. Meger, “Human motion prediction via patterncompletion in latent representation space,” in 2019 16th Conferenceon Computer and Robot Vision (CRV). IEEE, 2019, pp. 57–64.

[4] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, andS. Yan, “Predicting scene parsing and motion dynamics in the future,”in Advances in Neural Information Processing Systems (NIPS), 2017,pp. 6915–6924.

[5] R. Roesser, “A discrete state-space model for linear image processing,”IEEE Transactions on Automatic Control, vol. 20, no. 1, pp. 1–10,1975.

[6] K. KAWAMURA, T. MATSUBARA, and K. UEHARA, “Deep state-space model for noise tolerant skeleton-based action recognition,”IEICE TRANSACTIONS on Information and Systems, vol. 103, no. 6,pp. 1217–1225, 2020.

[7] L.-Y. Gui, Y.-X. Wang, D. Ramanan, and J. M. Moura, “Few-shothuman motion prediction via meta-learning,” in European Conferenceon Computer Vision (ECCV), 2018.

[8] Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang, “Tea: Temporalexcitation and aggregation for action recognition,” in The IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), 2020,pp. 909–918.

[9] P. Kratzer, M. Toussaint, and J. Mainprice, “Prediction of humanfull-body movements with motion optimization and recurrent neuralnetworks,” in 2020 IEEE International Conference on Robotics andAutomation (ICRA). IEEE, 2020, pp. 1792–1798.

[10] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn:Deep learning on spatio-temporal graphs,” in The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016.

[11] A. Gopalakrishnan, A. Mali, D. Kifer, C. L. Giles, and A. G. Ororbia,“A neural temporal model for human motion prediction,” in The IEEEConference on Computer Vision and Pattern Recognition (CVPR),2019.

[12] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Symbioticgraph neural networks for 3d skeleton-based human action recognitionand motion prediction,” arXiv preprint arXiv:1910.02212, 2019.

[13] X. R. Li and V. P. Jilkov, “Survey of maneuvering target tracking. parti. dynamic models,” IEEE Transactions on aerospace and electronicsystems, vol. 39, no. 4, pp. 1333–1364, 2003.

[14] J. K. Aggarwal and Q. Cai, “Human motion analysis: A review,”Computer vision and image understanding, vol. 73, no. 3, pp. 428–440, 1999.

[15] C. Li, Z. Zhang, W. S. Lee, and G. H. Lee, “Convolutional sequenceto sequence model for human dynamics,” in The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2018.

[16] W. Mao, M. Liu, M. Salzmann, and H. Li, “Learning trajectorydependencies for human motion prediction,” in The IEEE InternationalConference on Computer Vision (ICCV), 2019, pp. 9489–9497.

[17] J. Martinez, M. J. Black, and J. Romero, “On human motion pre-diction using recurrent neural networks,” in The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2017.

[18] M. Li, S. Chen, Y. Zhao, Y. Zhang, Y. Wang, and Q. Tian, “Dynamicmultiscale graph neural networks for 3d skeleton based human motionprediction,” in The IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2020, pp. 214–223.

[19] H. Wang and J. Feng, “Vred: A position-velocity recurrentencoder-decoder for human motion prediction,” arXiv preprintarXiv:1906.06514, 2019.

[20] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent networkmodels for human dynamics,” in The IEEE International Conferenceon Computer Vision (ICCV), 2015.

[21] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m:Large scale datasets and predictive methods for 3d human sensingin natural environments,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 36, no. 7, pp. 1325–1339, jul 2014.

[22] X. Guo and J. Choi, “Human motion prediction via learning localstructure representations and temporal dependencies,” in AAAI Con-ference on Artificial Intelligence (AAAI), 2019.

[23] A. Hernandez, J. Gall, and F. Moreno-Noguer, “Human motionprediction via spatio-temporal inpainting,” in The IEEE InternationalConference on Computer Vision (ICCV), 2019, pp. 7134–7143.

[24] Z. Liu, S. Wu, S. Jin, Q. Liu, S. Lu, R. Zimmermann, and L. Cheng,“Towards natural and accurate future motion prediction of humans andanimals,” in The IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2019.

[25] L.-Y. Gui, Y.-X. Wang, X. Liang, and J. M. Moura, “Adversarialgeometry-aware human motion prediction,” in European Conferenceon Computer Vision (ECCV), 2018.

[26] H. Wang and J. Feng, “Vred: A position-velocity recurrentencoder-decoder for human motion prediction,” arXiv preprintarXiv:1906.06514, 2019.

[27] J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom, “Deep rep-resentation learning for human motion prediction and classification,”in The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017.

[28] H.-k. Chiu, E. Adeli, B. Wang, D.-A. Huang, and J. C. Niebles,“Action-agnostic human pose forecasting,” in IEEE Winter Conf. onApplications of Computer Vision (WACV), 2019.

[29] M. Karl, M. Soelch, J. Bayer, and P. Van der Smagt, “Deep variationalbayes filters: Unsupervised learning of state space models from rawdata,” arXiv preprint arXiv:1605.06432, 2016.

[30] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, andT. Januschowski, “Deep state space models for time series forecasting,”in Advances in neural information processing systems, 2018, pp. 7785–7794.

[31] X. Liu, J. Yin, H. Liu, and Y. Yin, “Pisep2: Pseudo image sequenceevolution based 3d pose prediction,” arXiv preprint arXiv:1909.01818,2019.

[32] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural networkfor skeleton based action recognition,” in The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015, pp. 1110–1118.

[33] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700–4708.

[34] T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll, “Recovering accurate 3d human pose in the wild using imus anda moving camera,” in The European Conference on Computer Vision(ECCV), 2018, pp. 601–617.