neural dynamics on complex networks...we ﬁrst introduce a differential equation system which mod-...

Neural Dynamics on Complex Networks

Chengxi Zang and Fei WangHealthcare Policy and Research, Weill Cornell [email protected], [email protected]

New York City, New York, USA

AbstractLearning dynamics on complex networks governed by dif-ferential equation systems is crucial for understanding, pre-dicting, and controlling complex systems in science and en-gineering. However, this task is very challenging due to theintrinsic complexities in the structures of the high dimensionalsystems, their elusive continuous-time nonlinear dynamics,and their structural-dynamic dependencies. To address thesechallenges, we propose a differential deep learning model tolearn continuous-time dynamics on complex networks in adata-driven way. We model differential equation systems bygraph neural networks. Instead of mapping through a discretenumber of hidden layers in the forward process, we solvethe initial value problem by integrating the neural differentialequation systems over time numerically. In the backward pro-cess, we learn the optimal parameters by back-propagatingagainst the forward integration. We validate our model bylearning and predicting various real-world dynamics on dif-ferent complex networks in both (continuous-time) networkdynamics learning setting and (regularly-sampled) structuredsequence learning setting, and then apply our model to graphsemi-supervised classification tasks (a one-snapshot case). Thepromising experimental results demonstrate our model’s capa-bility of jointly capturing the structure, dynamics, and seman-tics of complex systems in a unified framework.

1 IntroductionReal-world complex systems, such as brain (Gerstner et al.2014), ecological systems (Gao, Barzel, and Barabasi 2016),gene regulation (Alon 2006), human health (Bashan et al.2016), and social networks (Zang et al. 2018), etc., are usu-ally modeled as complex networks and their evolution aregoverned by some underlying nonlinear dynamics (Newman,Barabasi, and Watts 2011). Revealing such complex networkdynamics is crucial for understanding the complex systems innature. Effective analytical tools developed for this goal canfurther help us predict and control these complex systems.

Although the theory of (nonlinear) dynamical systems hasbeen widely studied in different fields including applied math(Strogatz 2018),statistical physics (Newman, Barabasi, andWatts 2011), engineering (Slotine, Li, and others 1991), ecol-ogy (Gao, Barzel, and Barabasi 2016) and biology (Bashanet al. 2016), these developed models are typically based on a

Copyright c© 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

clear knowledge of the network evolution mechanism whichare thus usually referred to as mechanistic models. Given thecomplexity of the real world, there is still a large number ofcomplex networks whose underlying dynamics are unknownyet (e.g., they can be too complex to be modeled by explicitmathematical functions). At the same time, massive data areusually generated during the evolution of these networks.Therefore, modern data-driven approaches are promising andhighly demanding in learning the elusive dynamics on com-plex networks.

The development of a successful data-driven approachfor modeling the dynamics on complex networks is verychallenging: the interaction structures of the nodes in thenetwork are complex and the number of nodes is large,which is referred to as the high-dimensionality of the com-plex systems; the rules governing the dynamic change of thenodes’ states in complex networks are nonlinear, continuous-time and elusive; the structural and dynamic dependencieswithin the system are difficult to model. Recently, there isan emerging trend in the data-driven discovery of ordinarydifferential equations (ODEs) or partial differential equations(PDEs), including sparse regression method (Kutz et al. 2017;Mangan et al. 2016; Rudy et al. 2017), residual network(Qin, Wu, and Xiu 2018), feedforward neural network (Raissi,Perdikaris, and Karniadakis 2018), etc. However, these meth-ods can only handle very small ODE systems or PDEs whichconsist of only a few terms. Effective learning of the dy-namics on large complex networks which consist of tens ofthousands of interactions is still largely unknown.

In this paper, we propose a differential deep learning ap-proach to learn continuous-time dynamics on complex net-works. We model (high-dimensional) differential equationsystems by graph neural networks to capture the instanta-neous change of network dynamics. Instead of mappingthrough a discrete number of layers in the forward process ofthe conventional neural network models (LeCun, Bengio, andHinton 2015), we integrate the dynamics on graphs modeledby a neural differential equation system over continuous time.This is like a deep neural network with an infinite number oflayers (Chen et al. 2018). In a dynamical system view, thecontinuous depth can be interpreted as continuous physicaltime, and the outputs of a hidden layer at time t are instan-taneous network dynamics at that moment. In the backwardlearning process, we back-propagate the gradients of the su-

pervised information w.r.t. the learnable parameters againstthe forward integration, leading to learning the differentialequation system in an end-to-end manner. Besides, we furtherenhance our algorithm by learning the dynamics in a hiddenspace learned from the original space of nodes’ states. Wename our model Neural Dynamics on Complex Networks(NDCN).

We validate our approach by three general tasks: 1) (Net-work dynamics learning): Can we learn the continuous-timedynamics on complex networks? 2) (Structured sequencelearning): Can we predict the regularly-sampled structuredsequence? (Seo et al. 2018) 3) (One-snapshot learning): Canwe infer the semantic labels of nodes at the terminal timemoment? The experimental results show that our model canaccurately learn and predict the real-world dynamics on vari-ous complex networks, in both continuous-time setting andregularly-sampled sequence setting. What’s more, our modellearns the semantic labels of nodes in the setting of graphsemi-supervised learning (Kipf and Welling 2017) with verycompetitive performance. Our framework potentially servesas a unified framework to jointly capture the structure, dy-namics, and semantics of complex systems in a data-drivenmanner. Our codes and datasets are open-sourced (Refer toAppendix A).

2 Related workDynamics of complex networks. Real-world complex sys-tems are usually modeled as complex networks and driven bynonlinear dynamics: the dynamics of brain and human micro-bial are examined in (Gerstner et al. 2014) and (Bashan et al.2016) respectively; (Gao, Barzel, and Barabasi 2016) investi-gated the resilience dynamics of complex systems. (Barzel,Liu, and Barabasi 2015) gave a pipeline to construct networkdynamics. To the best of our knowledge, our NDCN modelis the first neural network approach which learns continuous-time dynamics on complex networks in a data-driven manner.

Data-driven discovery of dynamics. Recently, somedata-driven approaches are proposed to learn ODEs or PDEs,including sparse regression (Kutz et al. 2017), residual net-work (Qin, Wu, and Xiu 2018), feedforward neural network(Raissi, Perdikaris, and Karniadakis 2018), coupled neuralnetworks (Raissi 2018) and so on. (Mangan et al. 2016) triesto learn biological networks dynamics by sparse regressionover a large library, which is not scalable to systems withmore than 10 nodes. In all, none of them can learn the dynam-ics on complex systems with more than hundreds of nodesand tens of thousands of interactions.

Neural ODEs. Inspired by residual network (He et al.2016) and ordinary differential equation (ODE) theory (Luet al. 2017; Ruthotto and Haber 2018), seminal work neuralODE model (Chen et al. 2018) was proposed to re-write resid-ual networks, normalizing flows, and recurrent neural net-work in a dynamical system way. However, our NDCN modeldeals with large and complex differential equations systems.Besides, our model solves different problems, namely learn-ing the dynamics on complex networks.

Optimal control. Relationships between back-propagation in deep learning and optimal controltheory are investigated in (Han, Li, and others 2018;

Benning et al. 2019). We formulate our loss function byleveraging the concept of running loss and terminal loss inoptimal control. We give novel constraints in optimal controlwhich is modeled by neural differential equations systemson graphs. Our model solves novel tasks, e.g. learning thedynamics on complex networks and refer to Sec.3.1.

Graph neural networks and temporal-GNNs. Graphneural networks (GNNs) (Wu et al. 2019b), e.g., Graphconvolution network (GCN) (Kipf and Welling 2017),attention-based GNN (AGNN) (Thekumparampil et al. 2018),graph attention networks (GAT) (Velickovic et al. 2017),etc., achieved state-of-the-art performance on graph semi-supervised learning tasks. However, existing GNNs usuallyhave 1 or 2 layers and can not go deep (Li, Han, and Wu 2018;Wu et al. 2019b). Our NDCN gives a dynamical system viewon GNNs: the continuous depth can be interpreted as con-tinuous physical time, and the outputs of a hidden layer attime t are instantaneous network dynamics at that moment.By capturing continuous-time network dynamics and theirtransient behaviors, our model gives very competitive andeven better results than above GNNs.

By combining RNNs or convolution operators with GNNs,temporal-GNNs (Yu, Yin, and Zhu 2017; Kazemi et al. 2019;Narayan and Roe 2018; Seo et al. 2018) try to predict theregularly-sampled structured sequences. However, these mod-els can not be applied to continuous-time dynamics (observedat arbitrary physical times with different time intervals). OurNDCN not only predicts the continuous-time network dynam-ics at an arbitrary time or semantic labels from one snapshotbut also predicts the structured sequences very well in a moresuccinct way with much fewer parameters.

3 Preliminaries3.1 Problem DefinitionWe first introduce a differential equation system which mod-els the dynamics on complex networks:

dX(t)

dt= f

(X(t), G,W (t), t

), (1)

where X(t) ∈ Rn×d represents the state (node feature val-ues) of a dynamic system consisting of n linked nodes attime t ∈ [0,∞), and each node is characterized by d di-mensional features. G = (V, E) is the network structurecapturing how the nodes are linked to each other. W (t) areparameters which control how the system evolves over time.X(0) = X0 is the initial states of this system at time t = 0.The function f : Rn×d → Rn×d is a function governingthe dynamics of the system, which could be either linearor nonlinear. In addition, nodes can have various semanticlabels Y (X,Θ, t) ∈ {0, 1}n×k at time t, and Θ representsthe parameters of this classification function. The problemswe are trying to solve in this paper are:• (Network dynamics learning) How to learn the

continuous-time dynamics dX(t)dt on complex net-

works from empirical data? Mathematically, given agraph G and the observations of the states of system{ ˆX(t1), ˆX(t2), ..., ˆX(tT )|0 ≤ t1 < ... < tT }, and t1to tT are arbitrary physical time stamps, can we learn

differential equation systems dX(t)dt = f(X,G,W, t) to

generate or predict continuous-time dynamics X(t) at ar-bitrary physical time t? The arbitrary physical time meansthat {t1, ..., tT } are possibly irregularly sampled with dif-ferent observational time intervals. When t > tT , wecall the task extrapolation prediction, while t < tT andt 6= {t1, ..., tT } for interpolation prediction.

• (Structured sequence learning). As a special case whent1, t2, ..., tT are sampled regularly with equal time inter-vals, the above problem degenerates to a structured se-quence learning task with an emphasis on sequential orderinstead of arbitrary physical time. The goal is to extrapolatenext m steps’ X[tT + 1], ..., X[tT +m] .

• (One-snapshot learning) How to learn the semantic la-bels of Y (X(tT )) at the moment t = tT for each node?As a special case of above problem with an emphasis on aspecific moment and without loss of generality, we focuson the moment at the terminal time tT . The function Ycan be a mapping from the nodes’ states (e.g. humidity) totheir labels (e.g. taking umbrella or not).

3.2 Network DynamicsWe investigate the following three real-world network dy-namics. Let

−−→xi(t) ∈ Rd×1 be d dimensional features of node

i at time t and thus X(t) = [. . . ,−−→xi(t), . . . ]

T ∈ Rn×d. Weshow their differential equation systems in vector form forclarity and implement them in matrix form:

• The heat diffusion dynamics d−−−→xi(t)dt =

−ki,j∑nj=1Ai,j(

−→xi − −→xj) governed by Newton’slaw of cooling (Luikov 2012), which states that the rateof heat change of node i is proportional to the differenceof the temperature between node i and its neighbors withheat capacity matrix A.

• The mutualistic interaction dynamics among species in

ecology, governed by equation d−−−→xi(t)dt = bi + −→xi(1 −

−→xi

ki)(−→xi

ci− 1) +

∑nj=1Ai,j

−→xi−→xj

di+ei−→xi+hj

−→xj(For brevity, the

operations between vectors are element-wise). The mu-tualistic differential equation systems (Gao, Barzel, andBarabasi 2016) capture the abundance ~xi(t) of species i,consisting of incoming migration term bi, logistic growthwith population capacity ki (Zang et al. 2018) and Alleeeffect (Allee et al. 1949) with cold-start threshold ci, andmutualistic interaction term with interaction network A.

• The gene regulatory dynamics governed by Michaelis-

Menten equation d−−−→xi(t)dt = −bi−→xif +

∑nj=1Ai,j

−→xjh

−→xjh+1

where the first term models degradation when f = 1 ordimerization when f = 2, and the second term capturesgenetic activation tuned by the Hill coefficient h (Alon2006; Gao, Barzel, and Barabasi 2016).

Complex Networks. We consider following networks: (a)Grid network, where each node is connected with 8 neigh-bors (as shown in Fig. 2(a)) ; (b) Random network, gen-erated by Erdos and Renyi model (Erdos and Renyi 1959)(as shown in Fig. 2(b)); (c) Power-law network, generatedby Albert-Barabasi model (Barabasi and Albert 1999) (as

shown in Fig. 2(c)); (d) Small-world network, generated byWatts-Strogatz model (Watts and Strogatz 1998) (as shown inFig. 2(d)); and (e) Community network, generated by randompartition model (Fortunato 2010) (as shown in Fig. 2(e)).

Visualization. To visualize dynamics on complex networksover time is not trivial. We first generate a network with nnodes by aforementioned network models. The nodes arere-ordered according to the community detection methodby Newman (Newman 2010) and each node has a uniquelabel from 1 to n. We layout these nodes on a 2-dimensional√n ×√n grid and each grid point (r, c) ∈ N2 represents

the ith node where i = r√n + c + 1. Thus, nodes’ states

X(t) ∈ Rn×d at time t when d = 1 can be visualized asa scalar field function X : N2 → R over the grid. Pleaserefer to Appendix B for the animations of these dynamics ondifferent complex networks over time.

4 General frameworkWe formulate our general framework as follows:

argminW (t),Θ(T )

L =

∫ T

0

R(X(t), G,W, t

)dt+ S

(Y (X(T ),Θ)

)subject to

dX(t)

dt= f

(X(t), G,W, t

), X0

(2)

whereR(X(t), G,W, t) is the running loss of the dynamicson graph at time t, and S(Y (X(T ),Θ)) is the terminal se-mantic loss at time T . By integrating dX

dt = f(X,G,W, t)over time t from initial state X0, a.k.a. solving the initialvalue problem (Boyce, DiPrima, and Meade 1992) for thisdifferential equation system, we can get the continuous-timedynamics X(t) = X(0) +

∫ T0f(X(τ), G,W, τ) dτ at arbi-

trary time moment t > 0.Such a formulation can be seen as an optimal con-

trol problem so that the goal becomes to obtain the bestcontrol parameters W (t) for differential equation systemdXdt = f(X,G,W, t) and the best classification parameters

Θ for semantic function Y (X(t),Θ) by solving above op-timization problem. Different from traditional optimal con-trol framework, we model the differential equation systemsdXdt = f(X,G,W, t) by graph neural networks. By inte-

grating dXdt = f(X,G,W, t) over continuous time, namely

X(t) = X(0) +∫ t

0f(X(τ), G,W, τ

)dτ , we get our dif-

ferential deep learning models. In a dynamical system view,our differential deep learning models can be a time-varyingcoefficient dynamical system when W (t) changes over time;or a constant coefficient dynamical system when W is con-stant over time for parameter sharing. It’s worthwhile torecall that the deep learning methods with L hidden neu-ral layers f∗ are X[L] = fL ◦ ... ◦ f2 ◦ f1(X[0]), whichare iterated maps (Strogatz 2018) with an integer numberof discrete layers and thus can not learn continuous-timedynamics X(t) at arbitrary time. In contrast, our modelX(t) = X(0) +

∫ t0f(X(τ), G,W, τ

)dτ can have con-

tinuous layers with a real number t depth corresponding tocontinuous-time dynamics.

Moreover, to further increase the express ability of ourmodel, we can encode the network signal X(t) from the orig-inal space to Xh(t) in hidden space (usually with a different

number of dimensions), and learn the dynamics in such aspace. Then our model becomes:

argminW (t),Θ(T )

L =

∫ T

0

R(X(t), G,W, t

)dt+ S

(Y (X(T ),Θ)

)subject to Xh(t) = fe

(X(t)

)dXh(t)

dt= f

(Xh(t), G,W, t

), Xh(0)

X(t) = fd

(Xh(t)

)(3)

where the first constraint transforms X(t) into hidden spaceXh(t) through encoding function fe. The second constraintis the governing dynamics in the hidden space. The thirdconstraint decodes the hidden signal back to the originalspace with decoding function fd. The design of fe, f , and fdare flexible to be any neural structure (e.g. fd is a softmaxfunction for classification). We denote our model as NeuralDynamics on Complex Networks (NDCN).

We solve the initial value problem (i.e., integrating thedifferential equation systems over time numerically) by nu-merical methods (e.g., 1st-order Euler method, high-ordermethod Dormand-Prince DOPRI5 (Dormand 1996), etc.).The numerical methods can approximate continuous-timedynamics X(t) = X(0) +

∫ t0f(X(τ), G,W, τ

)dτ at ar-

bitrary time t accurately with guaranteed error. In order tolearn the learnable parameters W , we back-propagate thegradients of the loss function w.r.t the control parameters∂L∂W over the numerical integration process backwards in anend-to-end manner, and solve the optimization problem bystochastic gradient descent methods (e.g., Adam (Kingmaand Ba 2015)). We will show some concrete examples of theabove framework in the next three sections.

5 Learning continuous-time networkdynamics

In this section, we investigate if our NDCN model can learnaforementioned network dynamics in the continuous-timesetting for both interpolation prediction and extrapolationprediction. The continuous-time setting means that the ob-servational times t1 to tT of the observed states of sys-tem { ˆX(t1), ˆX(t2), ..., ˆX(tT )} are arbitrary physical timestamps which are irregularly sampled with different observa-tional time intervals. The extrapolation prediction is to predictX(t) at arbitrary physical time moment t when t > tT , whilethe interpolation prediction is to predict X(t) when t < tTand t 6= {t1, ..., tT }.

5.1 A Model InstanceWe solve the objective function in (3) with an emphasis onrunning loss only. Without the loss of generality, we use`1-norm loss as the running loss R. More concretely, weadopt two fully connected neural layers with a nonlinearhidden layer as the encoding function fe, a graph convolutionneural network (GCN) like structure (Kipf and Welling 2017)but with a different graph diffusion operator Φ to modelthe instantaneous network dynamics in the hidden space,and a linear decoding function fd for regression tasks in the

𝑿𝑿 𝒍𝒍

𝑋𝑋 𝑙𝑙 + 1 = 𝑋𝑋(𝑙𝑙) + 𝑓𝑓ℎ(𝑋𝑋(𝑙𝑙))

Diffusion

Weight

ReLU

× 𝜹𝜹 = 𝟏𝟏

+

𝑿𝑿 𝒕𝒕

𝑋𝑋 𝑡𝑡 + 𝛿𝛿 = 𝑋𝑋(𝑡𝑡) + �𝑡𝑡

𝑡𝑡+𝛿𝛿

𝑓𝑓(𝑋𝑋,𝐺𝐺,𝑊𝑊, 𝜏𝜏)𝑑𝑑𝜏𝜏

𝑿𝑿𝒉𝒉 𝒕𝒕

𝑋𝑋ℎ 𝑡𝑡 + 𝛿𝛿 = 𝑋𝑋ℎ(𝑡𝑡) + �𝑡𝑡

𝑡𝑡+𝛿𝛿

𝑓𝑓(𝑋𝑋ℎ,𝐺𝐺,𝑊𝑊, 𝜏𝜏)𝑑𝑑𝜏𝜏

𝑿𝑿 𝒕𝒕

Diffusion

Weight

ReLU

× 𝜹𝜹 → 𝟎𝟎+

+ �𝒕𝒕

𝒕𝒕+𝜹𝜹

Diffusion

Weight

ReLU

Wei

ght

Tanh

Wei

ght

�𝒕𝒕

𝒕𝒕+𝜹𝜹

𝑓𝑓𝑒𝑒

𝑓𝑓𝑑𝑑

a) Residual-GNN b) ODE-GNN c) NDCN

𝒅𝒅𝑿𝑿 𝒕𝒕𝒅𝒅𝒕𝒕

𝒅𝒅𝑿𝑿𝒉𝒉 𝒕𝒕𝒅𝒅𝒕𝒕

GCN Layer

Wei

ght

𝑿𝑿 𝒍𝒍 + 𝟏𝟏 − 𝑿𝑿(𝒍𝒍)

Figure 1: Illustration of an NDCN instance. a) ResidualGraph Neural Networks, b) ODE-GNN model and c) OurNeural Dynamics on Complex Network (NDCN) model. Theinteger l represents the discrete lth layer and the real numbert represents continuous physical time.original signal space. Thus, our model is (see neural structurein Figure 1c) :

argminW∗,b∗

L =

∫ T

0

|X(t)− ˆX(t)| dt

subject to Xh(t) = tanh(X(t)We + be

)W0 + b0

dXh(t)

dt= ReLU

(ΦXh(t)W + b

), Xh(0)

X(t) = Xh(t)Wd + bd

(4)

where ˆX(t) ∈ Rn×d is the supervised dynamic informationavailable at time stamp t (in the semi-supervised case themissing information can be padded by 0). The |·| denotes `1-norm loss (mean element-wise absolute value difference) be-tween X(t) and ˆX(t) at time t ∈ [0, T ]. We adopt diffusionoperator Φ = D−

12 (D −A)D−

12 ∈ Rn×n which is the nor-

malized graph Laplacian where A ∈ Rn×n is the adjacencymatrix of the network and D ∈ Rn×n is the correspondingnode degree matrix. The W ∈ Rde×de and b ∈ Rn×de areshared parameters (namely, the weights and bias of a linearconnection layer) over time t ∈ [0, T ]. The We ∈ Rd×de andW0 ∈ Rde×de are the matrices in linear layers for encoding,while Wd ∈ Rde×d are for decoding. The be, b0, b, bd are thebiases at the corresponding layer. We lean the parametersWe,W0,W,Wd, be, b0, b, bd from empirical data so that wecan learn X in a data-driven manner.

We design the neural differential equation system asdX(t)dt = ReLU(ΦX(t)W + b) to learn any unknown net-

work dynamics. We can regard dX(t)dt as a single neural

layer at time moment t. The X(t) at arbitrary time t isachieved by integrating dX(t)

dt over time, i.e., X(t) = X(0)+∫ t0

ReLU(

ΦX(τ)W + b)dτ , leading to a continuous-time

deep neural network.

5.2 ExperimentsBaselines. To the best of our knowledge, there are no base-lines for learning continuous-time dynamics on complex net-works, and thus we compare the ablation models of NDCNfor this task. By investigating ablation models we show thatour NDCN is a minimum model for this task. We keep theloss function the same and construct the following baselines:

Grid Random Power Law

Small World

Real NDCN Real NDCN Real NDCN Real NDCN Real NDCN

(a) (b) (c) (d) (e) Community

0

Tim

e

Figure 2: Heat diffusion on different networks. Each of thefive vertical panels represents the dynamics on one networkover physical time. For each network dynamics, we illustratethe sampled ground truth dynamics (left) and the dynamicsgenerated by our NDCN (right) from top to bottom followingthe direction of time.

• The model without encoding fe and fd and thus no hid-den space: dX(t)

dt = ReLU(ΦX(t)W + b) , namely ODE-GNN, which learns the dynamics in the original signalspace X(t) as shown in Fig. 1b;

• The model without graph diffusion operator Φ: dXh(t)dt =

ReLU(Xh(t)W +b), i.e., an ODE Neural Network, whichcan be thought as a continuous-time version of forwardresidual neural network (See Fig. 1a and Fig. 1b for thedifference between residual network and ODE network).

• The model without control parameters W : dXh(t)dt =

ReLU(ΦXh(t)) which has no linear connection layer be-tween t and t+ dt (where dt→ 0) and thus indicating adetermined dynamics to spread signals.

Experimental setup. We generate underlying networkswith 400 nodes by network models in Sec.3.2 and the il-lustrations are shown in Fig. 2,3 and 4. We set the initialvalue X(0) the same for all the experiments and thus differ-ent dynamics are only due to their different dynamic rulesand underlying networks (See Appendix B).

We irregularly sample 120 snapshots of the continuous-time dynamics { ˆX(t1), ..., ˆX(t120)|0 ≤ t1 < ... < t120 ≤T} where the time intervals between t1, ..., t120 are different.We randomly choose 80 snapshots from ˆX(t1) to ˆX(t100)

for training, the left 20 snapshots from ˆX(t1) to ˆX(t100) fortesting the interpolation prediction task. We use the 20 snap-shots from ˆX(t101) to ˆX(t120) for testing the extrapolationprediction task.

We use Dormand-Prince method (Dormand 1996) to getthe ground truth dynamics, and use Euler method in theforward process of our NDCN (More configurations in Ap-pendix C). We evaluate the results by `1 loss and normalized`1 loss (normalized by the mean element-wise value of ˆX(t)),and they lead to the same conclusion (We report normalized`1 loss here and see Appendix E for `1 loss). Results are themean and standard deviation of the loss over 20 independentruns for 3 dynamic laws on 5 different networks by eachmethod.


Small World


Community

0

Tim

e

(a) (b) (c) (d) (e)

Figure 3: Biological mutualistic interaction on different net-works.


Small World


Community

0

Tim

e

(a) (b) (c) (d) (e)

Figure 4: Gene regulation dynamics on different networks.

Results We visualize the ground-truth and learned dynam-ics in Fig. 2,3 and 4, and please see the animations of thesenetwork dynamics in Appendix B. We find that one dynamiclaw may behave quite different on different networks: heat dy-namics may gradually die out to be stable but follow differentdynamic patterns in Fig. 2. Gene dynamics are asymptoticallystable on grid in Fig. 4a but unstable on random networksin Fig. 4b or community networks in Fig. 4e. Both generegulation dynamics in Fig. 4c and biological mutualisticdynamics in Fig. 3c show very bursty patterns on power-lawnetworks. However, visually speaking, our NDCN learns allthese different network dynamics very well.

The quantitative results of extrapolation and interpolationprediction are summarized in Table 1 and Table 2 respectively.We observe that our NDCN captures different dynamics onvarious complex networks accurately and outperforms allthe continuous-time baselines by a large margin, indicatingthat our NDCN potentially serves as a minimum model inlearning continuous-time dynamics on complex networks.

6 Learning Regularly-sampled DynamicsWhat’s more, our model also captures dynamics fromregularly-sampled network dynamics (i.e. the structured se-quence learning setting) very well.

Baselines. We compare our model with the temporal-GNNmodels which are usually combinations of RNN models andGNN models (Kazemi et al. 2019; Narayan and Roe 2018;Seo et al. 2018). We use GCN (Kipf and Welling 2017) as agraph structure extractor and use LSTM/GRU/RNN (Lipton,

Table 1: Continuous-time Extrapolation Prediction. Our NDCN predicts different continuous-time network dynamics ac-curately. Each result is the normalized `1 error with standard deviation (in percentage %) from 20 runs for 3 dynamics on 5networks by each method.

Grid Random Power Law Small World Community

HeatDiffusion

No-Encode 29.9± 7.3 27.8± 5.1 24.9± 5.2 24.8± 3.2 30.2± 4.4No-Graph 30.5± 1.7 5.8± 1.3 6.8± 0.5 10.7± 0.6 24.3± 3.0No-Control 73.4± 14.4 28.2± 4.0 25.2± 4.3 30.8± 4.7 37.1± 3.7NDCN 4.1± 1.2 4.3± 1.6 4.9± 0.5 2.5± 0.4 4.8± 1.0

MutualisticInteraction


GeneRegulation


Table 2: Continuous-time Interpolation Prediction. Our NDCN predicts different continuous-time network dynamics accu-rately. Each result is the normalized `1 error with standard deviation (in percentage %) from 20 runs for 3 dynamics on 5networks by each method.


HeatDiffusion




GeneRegulation


Berkowitz, and Elkan 2015) to learn the temporal relation-ships between ordered structured sequences. We keep theloss function the same and construct the following baselines.We denote each recurrent cell as LSTM/GRU/RNN and referto Appendix D for the detailed equations.• LSTM-GNN: the temporal-GNN with LSTM cell X[t +

1] = LSTM(GCN(X[t], G)).• GRU-GNN: the temporal-GNN with GRU cell X[t+ 1] =GRU(GCN(X[t], G)).

• RNN-GNN: the temporal-GNN with RNN cell X[t+1] =RNN(GCN(X[t], G)).

Experimental setup. We regularly sample 100snapshots of the continuous-time network dynamics{ ˆX[t1], ..., ˆX[t100]|0 ≤ t1 < ... < t120 ≤ T} where thetime intervals between t1, ..., t100 are the same. We usefirst 80 snapshots ˆX[t1], ..., ˆX[t80] for training and the left20 snapshots ˆX[t81], ..., ˆX[t100] for testing extrapolationprediction task. The temporal-GNN models are usuallyused for next few step prediction and can not be used forthe interpolation task (say, to predict X[t1.23]) directly.We use 5 and 10 for hidden dimension of GCN and RNNmodels respectively. Other settings are the same as previouscontinuous-time dynamics experiment.

Results We summarize the results of the extrapolation pre-diction of regularly-sampled dynamics in Table 3. The GRU-GNN model works well in mutualistic dynamics on randomnetwork and community network. Our NDCN predicts differ-ent dynamics on these complex networks accurately and out-performs the baselines in almost all the settings. What’s more,our model capture the structure and dynamics in a much moresuccinct way. The learnable parameters of our NDCN , RNN-GNN, GRU-GNN, LSTM-GNN are 901, 24530, 64770, and

84890 respectively.

7 Learning semantic labels at terminal timeWe investigate the third question, i.e., how to learn the se-mantic labels of each node at the terminal time? Variousgraph neural networks (GNN) (Wu et al. 2019b) achievethe state-of-the-art performance in graph semi-supervisedclassification task (Yang, Cohen, and Salakhutdinov 2016;Kipf and Welling 2017). Existing GNNs usually adopt 1 or 2hidden layers (Kipf and Welling 2017; Velickovic et al. 2017)and cannot go deep (Li, Han, and Wu 2018). Our frameworkfollows the perspective of a dynamical system, and goes be-yond an integer number L of hidden layers in GNNs to a realnumber depth t of hidden layers, implying continuous-timedynamics on the graph. By integrating continuous-time dy-namics on the graph over time, we get a more fine-grainedforward process and thus our NDCN model shows very com-petitive even better results compared with state-of-the-artGNN models which may have sophisticated parameters (e.g.attention).

7.1 A Model InstanceFollowing the same framework as in Section 3, we proposea simple model with the terminal semantic loss S(Y (T ))modeled by the cross-entropy loss for classification task:

argminWe,be,Wd,bd

L =

∫ T

0

R(t) dt−n∑

i=1

c∑k=1

Yi,k(T ) log Yi,k(T )

subject to Xh(0) = tanh(X(0)We + be

)dXh(t)

dt= ReLU

(ΦXh(t)

)Y (T ) = softmax(Xh(T )Wd + bd)

(5)

where Y (T ) ∈ Rn×c is the label distributions of nodes attime T ∈ R whose element Yi,k(T ) denotes the probability

Table 3: Regularly-sampled Extrapolation Prediction. Our NDCN predicts different structured sequences accurately. Eachresult is the normalized `1 error with standard deviation (in percentage %) from 20 runs for 3 dynamics on 5 networks by eachmethod.


HeatDiffusion

LSTM-GNN 12.8± 2.1 21.6± 7.7 12.4± 5.1 11.6± 2.2 13.5± 4.2GRU-GNN 11.2± 2.2 9.1± 2.3 8.8± 1.3 9.3± 1.7 7.9± 0.8RNN-GNN 18.8± 5.9 25.0± 5.6 18.9± 6.5 21.8± 3.8 16.1± 0.0NDCN 4.3± 0.7 4.7± 1.7 5.4± 0.4 2.7± 0.4 5.3± 0.7



GeneRegulation


of the node i = 1, . . . , n with label k = 1, . . . , c at time T .The Y ∈ Rn×c is the supervised information (again missinginformation can be padded by 0) observed at time t = T .We use differential equation system dX(t)

dt = ReLU(ΦX(t))to spread the graph signals over continuous time [0, T ], i.e.,Xh(T ) = Xh(0) +

∫ T0

ReLU(

ΦXh(t))dt.

Compared with the model in Eq.4, we only have su-pervised information from one snapshot at time t = T .Thus, we model the running loss

∫ T0R(t) dt as the `2-

norm regularizer of the learnable parameters∫ T

0R(t) dt =

λ(|We|22 + |be|22 + |Wd|22 + |bd|22) to avoid over-fitting. Weadopt the diffusion operator Φ = D−

12 (αI+(1−α)A)D−

12

where A is the adjacency matrix, D is the degree matrixand D = αI + (1 − α)D keeps Φ normalized. The pa-rameter α ∈ [0, 1] tunes nodes’ adherence to their previ-ous information or their neighbors’ collective opinion. Weuse it as a hyper-parameter here for simplicity and we canmake it as a learnable parameter later. The differential equa-tion system dX

dt = ΦX follows the dynamics of averaging

the neighborhood opinion as d−−−→xi(t)dt = α

(1−α)di+α

−−→xi(t) +∑n

j Ai,j1−α√

(1−α)di+α√

(1−α)dj+α

−−−→xj(t) for node i. When

α = 0, Φ averages the neighbors as normalized randomwalk, when α = 1, Φ captures exponential dynamics with-out network effects, and when α = 0.5, Φ averages bothneighbors and itself as in (Kipf and Welling 2017).

7.2 ExperimentsWe validate our model in the graph semi-supervised classifi-cation setting. For the consistency of comparison with priorworks, we follow the same experimental setup as (Kipf andWelling 2017; Velickovic et al. 2017; Thekumparampil etal. 2018). Refer to Appendix F for the detailed informationabout datasets, baselines, and their configurations.

Results We summarize the results in Table 4. We find ourNDCN outperforms many state-of-the-art GNN models. Re-sults for the baselines are taken from (Kipf and Welling 2017;Velickovic et al. 2017; Thekumparampil et al. 2018; Wu etal. 2019a). We report the mean and standard deviation of ourresults for 100 runs. We get our reported results in Table 4when terminal time T = 1.2 , α = 0 for the Cora dataset,T = 1.0, α = 0.8 for the Citeseer dataset, and T = 1.1,α = 0.4 for the Pubmed dataset.

Table 4: Test mean accuracy with standard deviation in per-centage (%) over 100 runs. Our NDCN model gives verycompetitive results compared with many GNN models.

Model Cora Citeseer Pubmed

GCN 81.5 70.3 79.0

AGNN 83.1± 0.1 71.7± 0.1 79.9± 0.1

GAT 83.0± 0.7 72.5± 0.7 79.0± 0.3

NDCN 83.3± 0.6 73.1± 0.6 79.8± 0.4(a) (b) (c)

Figure 5: Our NDCN model captures continuous-time dynam-ics. Mean classification accuracy of 100 runs over terminaltime when given a specific α. Insets are the accuracy over thetwo-dimensional space of terminal time and α

By capturing the continuous-time network dynamics todiffuse network signals, our NDCN gives better classifica-tion accuracy at terminal time T ∈ R+. Figure 5 plots themean accuracy with error bars over terminal time T in theabovementioned α settings (we further plot the accuracy overterminal time T and α in the insets and Appendix G). We findfor all the three datasets their accuracy curves follow rise andfall patterns around the best terminal time. Indeed, when theterminal time T is too small or too large, the accuracy degen-erates because the features of nodes are in under-diffusionor over-diffusion states, implying the necessity in capturingcontinuous-time dynamics. In contrast, previous GNNs canonly have an discrete number of layers which can not capturethe continuous-time network dynamics accurately.

8 ConclusionWe propose a differential deep learning model to learncontinuous-time dynamics on complex networks. We modeldifferential equations systems by graph neural networksand integrate the neural differential equations systems overtime. By capturing the continuous-time network dynamics,our NDCN gives the meanings of physical time and thecontinuous-time network dynamics to the depth and hiddenoutputs respectively, learns real-world dynamics on complexnetwork accurately in both (irregularly-sampled) continuous-time setting and (regularly-sampled) structured sequence set-ting, and outperforms many GNN models in the graph semi-supervised classification task (a one-snapshot case). Codesand datasets are open-sourced (See Appendix A).

ReferencesAllee, W. C.; Park, O.; Emerson, A. E.; Park, T.; Schmidt, K. P.;et al. 1949. Principles of animal ecology. Technical report, SaundersCompany Philadelphia, Pennsylvania, USA.Alon, U. 2006. An introduction to systems biology: design principlesof biological circuits. Chapman and Hall/CRC.Barabasi, A.-L., and Albert, R. 1999. Emergence of scaling inrandom networks. science 286(5439):509–512.Barzel, B.; Liu, Y.-Y.; and Barabasi, A.-L. 2015. Constructing mini-mal models for complex system dynamics. Nature communications.Bashan, A.; Gibson, T. E.; Friedman, J.; Carey, V. J.; Weiss, S. T.;Hohmann, E. L.; and Liu, Y.-Y. 2016. Universality of humanmicrobial dynamics. Nature 534(7606):259.Benning, M.; Celledoni, E.; Ehrhardt, M. J.; Owren, B.; andSchonlieb, C.-B. 2019. Deep learning as optimal control problems:models and numerical methods. arXiv preprint arXiv:1904.05657.Boyce, W. E.; DiPrima, R. C.; and Meade, D. B. 1992. Elemen-tary differential equations and boundary value problems, volume 9.Wiley New York.Chen, T. Q.; Rubanova, Y.; Bettencourt, J.; and Duvenaud, D. K.2018. Neural ordinary differential equations. In Advances in NeuralInformation Processing Systems, 6571–6583.Dormand, J. R. 1996. Numerical methods for differential equations:a computational approach, volume 3. CRC Press.Erdos, P., and Renyi, A. 1959. On random graphs i. Publ. Math.Debrecen 6:290–297.Fortunato, S. 2010. Community detection in graphs. Physics reports486(3-5):75–174.Gao, J.; Barzel, B.; and Barabasi, A.-L. 2016. Universal resiliencepatterns in complex networks. Nature 530(7590):307.Gerstner, W.; Kistler, W. M.; Naud, R.; and Paninski, L. 2014.Neuronal dynamics: From single neurons to networks and modelsof cognition. Cambridge University Press.Han, J.; Li, Q.; et al. 2018. A mean-field optimal control formulationof deep learning. arXiv preprint arXiv:1807.01083.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learningfor image recognition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, 770–778.Kazemi, S. M.; Goel, R.; Jain, K.; Kobyzev, I.; Sethi, A.; Forsyth, P.;and Poupart, P. 2019. Relational representation learning for dynamic(knowledge) graphs: A survey. arXiv preprint arXiv:1905.11485.Kingma, D. P., and Ba, J. 2015. Adam: A method for stochasticoptimization. In ICLR 2015.Kipf, T. N., and Welling, M. 2017. Semi-supervised classificationwith graph convolutional networks. In ICLR 2017.Kutz, J. N.; Rudy, S. H.; Alla, A.; and Brunton, S. L. 2017. Data-driven discovery of governing physical laws and their parametricdependencies in engineering, physics and biology. In 2017 IEEE7th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 1–5. IEEE.LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature521(7553):436.Li, Q.; Han, Z.; and Wu, X.-M. 2018. Deeper insights into graphconvolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence.Lipton, Z. C.; Berkowitz, J.; and Elkan, C. 2015. A critical reviewof recurrent neural networks for sequence learning. arXiv preprintarXiv:1506.00019.

Lu, Y.; Zhong, A.; Li, Q.; and Dong, B. 2017. Beyond finitelayer neural networks: Bridging deep architectures and numericaldifferential equations. arXiv preprint arXiv:1710.10121.Luikov, A. v. 2012. Analytical heat diffusion theory. Elsevier.Mangan, N. M.; Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2016.Inferring biological networks by sparse identification of nonlineardynamics. IEEE Transactions on Molecular, Biological and Multi-Scale Communications 2(1):52–63.Narayan, A., and Roe, P. H. 2018. Learning graph dynamics usingdeep neural networks. IFAC-PapersOnLine 51(2):433–438.Newman, M.; Barabasi, A.-L.; and Watts, D. J. 2011. The structureand dynamics of networks, volume 12. Princeton University Press.Newman, M. 2010. Networks: an introduction. Oxford U. press.Qin, T.; Wu, K.; and Xiu, D. 2018. Data driven governing equa-tions approximation using deep neural networks. arXiv preprintarXiv:1811.05537.Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2018. Multistepneural networks for data-driven discovery of nonlinear dynamicalsystems. arXiv preprint arXiv:1801.01236.Raissi, M. 2018. Deep hidden physics models: Deep learning ofnonlinear partial differential equations. The Journal of MachineLearning Research 19(1):932–955.Rudy, S. H.; Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2017.Data-driven discovery of partial differential equations. ScienceAdvances 3(4):e1602614.Ruthotto, L., and Haber, E. 2018. Deep neural networks motivatedby partial differential equations. arXiv preprint arXiv:1804.04272.Seo, Y.; Defferrard, M.; Vandergheynst, P.; and Bresson, X. 2018.Structured sequence modeling with graph convolutional recurrentnetworks. In International Conference on Neural Information Pro-cessing, 362–373. Springer.Slotine, J.-J. E.; Li, W.; et al. 1991. Applied nonlinear control,volume 199. Prentice hall Englewood Cliffs, NJ.Strogatz, S. H. 2018. Nonlinear Dynamics and Chaos with StudentSolutions Manual: With Applications to Physics, Biology, Chemistry,and Engineering. CRC Press.Thekumparampil, K. K.; Wang, C.; Oh, S.; and Li, L.-J. 2018.Attention-based graph neural network for semi-supervised learning.arXiv preprint arXiv:1803.03735.Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.;and Bengio, Y. 2017. Graph attention networks. arXiv preprintarXiv:1710.10903.Watts, D. J., and Strogatz, S. H. 1998. Collective dynamics ofsmall-worldnetworks. nature 393(6684):440.Wu, F.; Zhang, T.; Jr., A. H. S.; Fifty, C.; Yu, T.; and Weinberger,K. Q. 2019a. Simplifying graph convolutional networks. CoRR.Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; and Yu, P. S. 2019b.A comprehensive survey on graph neural networks. arXiv preprintarXiv:1901.00596.Yang, Z.; Cohen, W. W.; and Salakhutdinov, R. 2016. Revisitingsemi-supervised learning with graph embeddings. In ICML 2016,40–48.Yu, B.; Yin, H.; and Zhu, Z. 2017. Spatio-temporal graph convolu-tional networks: A deep learning framework for traffic forecasting.arXiv preprint arXiv:1709.04875.Zang, C.; Cui, P.; Faloutsos, C.; and Zhu, W. 2018. On power lawgrowth of social networks. IEEE Transactions on Knowledge andData Engineering 30(9):1727–1740.

A ReproducibilityTo ensure the reproducibility, we open-sourced our datasetsand Pytorch implementation empowered by GPU and sparsematrix at:

https://drive.google.com/open?id=19x7uas9G5w0gU8bHDoohJmDrVOt9wl8W

B Animations of the real-world dynamics ondifferent networks

Please view the animations of the three real-world dynamicson five different networks learned by different models at:

https://drive.google.com/open?id=1KBl-6Oh7BRxcQNQrPeHuKPPI6lndDa5Y

We will find our NDCN captures the real-world dynamics ondifferent networks very accurately while the baselines cannot. The detailed experimental configurations are shown asfollows:

B.1 Underlying NetworksWe generate various networks by as follows, and we visualizetheir adjacency matrix after re-ordering their nodes by thecommunity detection method by Newman (Newman 2010).• Grid network:

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

Figure 6: Adjacency matrix of grid network taking on a cir-culant matrix.

• Random network:

i m p o r t ne tworkx as nxn = 400G = nx . e r d o s r e n y i g r a p h( n , 0 . 1 , s eed = seed )

• Power-law network:

n = 400G = nx . b a r a b a s i a l b e r t g r a p h( n , 5 , s eed = seed )

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

Figure 7: Adjacency matrix of random network.

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

Figure 8: Adjacency matrix of power-law network.

• Small-world network:

n = 400G = nx . n e w m a n w a t t s s t r o g a t z g r a p h( 4 0 0 , 5 , 0 . 5 , s eed = seed )

• Community network:

n1 = i n t ( n / 3 )n2 = i n t ( n / 3 )n3 = i n t ( n / 4 )n4 = n − n1 − n2 −n3G = nx . r a n d o m p a r t i t i o n g r a p h( [ n1 , n2 , n3 , n4 ] , . 2 5 , . 0 1 , s eed = seed )

B.2 Initial Values of Network DynamicsWe set the initial valueX(0) the same for all the experimentalsettings and thus different dynamics are only due to their

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

Figure 9: Adjacency matrix of small-world network.

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

Figure 10: Adjacency matrix of community network.

different dynamic rules and underlying networks modelledby X = f(X,G,W, t) as shown in Fig. 2,3 and 4. Please seeabove animations to check out different network dynamics.

n = 400N = i n t ( np . c e i l ( np . s q r t ( n ) ) )x0 = t o r c h . z e r o s (N, N)x0 [ i n t ( 0 . 0 5∗N ) : i n t ( 0 . 2 5∗N) ,

i n t ( 0 . 0 5∗N ) : i n t ( 0 . 2 5∗N) ] = 25# x0 [ 1 : 5 , 1 : 5 ] = 25f o r N = 20 or n= 400 c a s ex0 [ i n t ( 0 . 4 5∗N ) : i n t ( 0 . 7 5∗N) ,

i n t ( 0 . 4 5∗N ) : i n t ( 0 . 7 5∗N) ] = 20# x0 [ 9 : 1 5 , 9 : 1 5 ] = 20 f o r N = 20 or n= 400 c a s ex0 [ i n t ( 0 . 0 5∗N ) : i n t ( 0 . 2 5∗N) ,

i n t ( 0 . 3 5∗N ) : i n t ( 0 . 6 5∗N) ] = 17# x0 [ 1 : 5 , 7 : 1 3 ] = 17 f o r N = 20 or n= 400 c a s e

B.3 Network DynamicsWe adopt the following three real-world dynamics from differentdisciplines. Please see above animations to check out the visual-ization of different network dynamics. The differential equationsystems are shown as follows:• The heat diffusion dynamics governed by Newton’s law of cool-

ing (Luikov 2012),

d−−→xi(t)

dt= −ki,j

∑n

j=1Ai,j(

−→xi −−→xj) (6)

states that the rate of heat change of node i is proportional to thedifference in the temperatures between i and its neighbors withheat capacity matrix A. We use k = 1 here.

• The mutualistic interaction dynamics among species in ecology,governed by equation

d−−→xi(t)

dt= bi +−→xi(1−

−→xiki

)(−→xici− 1) +

n∑j=1

Ai,j−→xi−→xj

di + ei−→xi + hj

−→xj.

(7)The mutualistic differential equation systems (Gao, Barzel, and

Barabasi 2016) capture the abundance ~xi(t) of species i, consist-ing of incoming migration term bi, logistic growth with popu-lation capacity ki (Zang et al. 2018) and Allee effect (Allee etal. 1949) with cold-start threshold ci, and mutualistic interactionterm with interaction network A. We use b = 0.1, k = 5.0,c = 1.0, d = 5.0, e = 0.9, h = 0.1 here.

• The gene regulatory dynamics governed by Michaelis-Mentenequation

d−−→xi(t)

dt= −bi ~xif +

∑n

j=1Ai,j

−→xjh−→xjh + 1

(8)

where the first term models degradation when f = 1 or dimeriza-tion when f = 2, and the second term captures genetic activationtuned by the Hill coefficient h (Gao, Barzel, and Barabasi 2016).We adopt b = 1.0, f = 1.0, h = 2.0 here.

B.4 Terminal Time:We use T = 5 for mutualistic dynamics and gene regulatory dynam-ics over different networks, and T = 5, 0.1, 0.75, 2, 0.2 for heatdynamics on the grid, random graph, power-law network, small-world network, and community network respectively due to theirdifferent time scale of network dynamics. Please see above anima-tions to check out different network dynamics.

B.5 Visualizations of network dynamicsPlease see above animations to check out the visualization of differ-ent network dynamics. We generate networks by aforementionednetwork models with n = 400 nodes. The nodes are re-orderedaccording to community detection method by Newman (Newman2010). We visualize their adjacency matrices in Fig. 11,12 and 13.We layout these networks in a grid and thus nodes’ states X(t)are visualized as functions on the grid. Specifically, the nodes arere-ordered according to community detection method by Newman(Newman 2010) and each node has a unique label from 1 to n. Welayout these nodes on a 2-dimensional

√n×√n grid and each grid

point (r, c) ∈ N2 represents the ith node where i = r√n+ c+ 1.

Thus, nodes’ states X(t) ∈ Rn×d when d = 1 can be visualized asa scalar field function X : N2 → R over the grid.


Small World


(a) (b) (c) (d) (e) Community

0

Tim

e

Figure 11: Heat diffusion on different networks. Each of the five vertical panels represents the dynamics on one network overphysical time. For each network dynamics, we illustrate the sampled ground truth dynamics (left) and the dynamics generated byour NDCN (right) from top to bottom following the direction of time.

C Model configurations of Learning NetworkDynamics in both continuous-time and

regularly-sampled settingsWe train our NDCN model by Adam (Kingma and Ba 2015). Wechoose 20 as the hidden dimension of Xh ∈ Rn×20. We train ourmodel for a maximum of 2000 epochs using Adam (Kingma andBa 2015) with learning rate 0.01. We summarize our `2 regular-ization parameter as in Table 5 and Table 6 for Section 5 learningcontinuous-time network dynamics. We summarize our `2 regular-ization parameter as in Table 7 for Section 6 learning regularly-sampled dynamics.

D Temporal-GNN modelsWe use following temporal-GNN models for structured sequencelearning:• LSTM-GNN: the temporal-GNN with LSTM cell: X[t+ 1] =LSTM(GCN(X[t], G)):

xt = ReLU(Φ ∗ (We ∗X[t] + be))

it = σ(Wiixt + bii +Whiht−1 + bhi)

ft = σ(Wifxt + bif +Whfht−1 + bhf )

gt = tanh(Wigxt + big +Whght−1 + bhg)

ot = σ(Wioxt + bio +Whoht−1 + bho)

ct = ft ∗ ct−1 + it ∗ gtht = ot ∗ tanh(ct)

ˆX[t+ 1] = Wd ∗ ht + bd

(9)

• GRU-GNN: the temporal-GNN with GRU cell: X[t + 1] =

GRU(GCN(X[t], G)):xt = ReLU(Φ ∗ (We ∗X[t] + be))

rt = σ(Wirxt + bir +Whrht−1 + bhr)

zt = σ(Wizxt + biz +Whzht−1 + bhz)

nt = tanh(Winxt + bin + r ∗ (Whnht−1 + bhn))

ht = (1− zt) ∗ nt + zt ∗ ht−1

ˆX[t+ 1] = Wd ∗ ht + bd(10)

• RNN-GNN: the temporal-GNN with RNN cell: X[t + 1] =RNN(GCN(X[t], G)):

xt = ReLU(Φ ∗ (We ∗X[t] + be))

ht = tanh(wihxt + bih + whhht−1 + bhh)

ˆX[t+ 1] = Wd ∗ ht + bd

(11)

We adopt the diffusion operator Φ = D−12 (αI + (1− α)A)D−

12

where A is the adjacency matrix, D is the degree matrix andD = αI + (1 − α)D keeps Φ normalized. The differentialequation system dX

dt= ΦX follows the dynamics of averag-

ing the neighborhood opinion as d−−−→xi(t)dt

= α(1−α)di+α

−−→xi(t) +∑n

j Ai,j1−α√

(1−α)di+α√

(1−α)dj+α

−−−→xj(t) for node i. When α = 0,

Φ averages the neighbors as normalized random walk, when α = 1,Φ captures exponential dynamics without network effects. Here weadopt α = 0.5, namely Φ averages both neighbors and itself asGCN in (Kipf and Welling 2017).

E Results in absolute error.We show corresponding `1 loss error in Table 8,Table 9 and Ta-ble 10 with respect to the normalized `1 loss error in Section 5


Small World


Community

0

Tim

e(a) (b) (c) (d) (e)

Figure 12: Biological mutualistic interaction on different networks.

Table 5: `2 regularization parameter configurations in continuous-time extrapolation predictionGrid Random Power Law Small World Community

HeatDiffusion

No-Encode 1e-3 1e-6 1e-3 1e-3 1e-5No-Graph 1e-3 1e-6 1e-3 1e-3 1e-5No-Control 1e-3 1e-6 1e-3 1e-3 1e-5NDCN 1e-3 1e-6 1e-3 1e-3 1e-5



GeneRegulation

No-Embed 1e-4 1e-4 1e-4 1e-4 1e-4No-Graph 1e-4 1e-4 1e-4 1e-4 1e-4No-Control 1e-4 1e-4 1e-4 1e-4 1e-4NDCN 1e-4 1e-4 1e-4 1e-4 1e-4

learning continuous-time network dynamics and Section 6 learningregularly-sampled dynamics. The same conclusions can be made asin Table 1,Table 2 and Table 3.

F Learning Semantic labelsWe summarize datasets, baselines, and experimental setups in Sec-tion 7 learning semantic labels at the terminal time.

F.1 Datasets and Baselines.We use three standard benchmark datasets (i.e., citation networkCora, Citeseer and Pubmed), and follow the same fixed split schemefor train, validation, and test as in (Yang, Cohen, and Salakhutdinov2016; Kipf and Welling 2017; Thekumparampil et al. 2018). Wesummarize the datasets in Appendix F.1 Table 11. We compareour NDCN model with graph convolution network (GCN) (Kipfand Welling 2017), attention-based graph neural network (AGNN)(Thekumparampil et al. 2018), and graph attention networks (GAT)(Velickovic et al. 2017) with sophisticated attention parameters.

F.2 Experimental setup.For the consistency of comparison with prior work, we follow thesame experimental setup as (Kipf and Welling 2017; Velickovic etal. 2017; Thekumparampil et al. 2018). We train our model basedon the training datasets and get the accuracy of classification resultsfrom the test datasets with 1, 000 labels as summarized in Table 11.Following hyper-parameter settings apply to all the datasets. Weset 16 evenly spaced time ticks in [0, T ] and solve the initial valueproblem of integrating the differential equation systems numericallyby DOPRI5 (Dormand 1996). We train our model for a maximumof 100 epochs using Adam (Kingma and Ba 2015) with learningrate 0.01 and `2-norm regularization 0.024. We grid search the bestterminal time T ∈ [0.5, 1.5] and the α ∈ [0, 1]. We use 256 hiddendimension. We report the mean and standard deviation of resultsfor 100 runs in Table 4. It’s worthwhile to emphasize that in ourmodel there is no running control parameters (i.e. linear connectionlayers in GNNs), no dropout (e.g., dropout rate 0.5 in GCN and 0.6in GAT), no early stop, and no concept of layer/network depth (e.g.,


Small World


Community

0

Tim

e(a) (b) (c) (d) (e)

Figure 13: Gene regulation dynamics on different networks.

Table 6: `2 regularization parameter configurations in continuous-time interpolation predictionGrid Random Power Law Small World Community

HeatDiffusion




GeneRegulation

No-Embed 1e-4 1e-4 1e-4 1e-4 1e-4No-Graph 1e-4 1e-4 1e-4 1e-4 1e-4No-Control 1e-4 1e-4 1e-4 1e-4 1e-4NDCN 1e-4 1e-4 1e-4 1e-4 1e-4

2 layers in GCN and GAT).

G Accuracy over terminal time and αBy capturing the continuous-time network dynamics, our NDCNgives better classification accuracy at terminal time T ∈ R+. In-deed, when the terminal time is too small or too large, the accuracydegenerates because the features of nodes are in under-diffusionor over-diffusion states. We plot the mean accuracy of 100 runs ofour NDCN model over different terminal time T and α as shown inthe following heatmap plots. we find for all the three datasets theiraccuracy curves follow rise and fall pattern around the best terminaltime.

Table 7: `2 regularization parameter configurations in regularly-sampled extrapolation predictionGrid Random Power Law Small World Community

HeatDiffusion

LSTM-GNN 1e-3 1e-3 1e-3 1e-3 1e-3GRU-GNN 1e-3 1e-3 1e-3 1e-3 1e-3RNN-GNN 1e-3 1e-3 1e-3 1e-3 1e-3NDCN 1e-3 1e-6 1e-3 1e-3 1e-5


LSTM-GNN 1e-3 1e-3 1e-3 1e-3 1e-3GRU-GNN 1e-3 1e-3 1e-3 1e-3 1e-3RNN-GNN 1e-3 1e-3 1e-3 1e-3 1e-3NDCN 1e-2 1e-3 1e-4 1e-4 1e-4

GeneRegulation

LSTM-GNN 1e-3 1e-3 1e-3 1e-3 1e-3GRU-GNN 1e-3 1e-3 1e-3 1e-3 1e-3RNN-Control 1e-3 1e-3 1e-3 1e-3 1e-3NDCN 1e-4 1e-4 1e-4 1e-3 1e-3

Table 8: Continuous-time Extrapolation Prediction. Our NDCN predicts different continuous-time network accurately. Eachresult is the `1 error with standard deviation from 20 runs for 3 dynamics on 5 networks for each method.


HeatDiffusion




GeneRegulation


0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5Terminal Time

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Alph

a

14.9 15.5 15.1 28.2 53.8 79.5 82.3 83.3 82.2 81.1 80.0

15.7 15.2 15.4 28.8 54.1 79.9 82.6 83.0 82.1 80.7 80.0

14.8 15.5 15.8 29.9 54.5 80.0 82.5 82.9 81.9 80.5 79.7

14.0 15.2 15.5 31.2 55.1 80.3 82.3 82.6 81.7 80.3 79.9

16.0 15.1 16.1 32.1 54.7 80.6 82.6 82.3 81.4 80.3 79.9

14.5 15.4 16.3 34.4 55.0 80.6 82.7 82.2 81.6 80.6 80.0

13.4 13.8 15.7 36.6 54.9 80.5 82.9 82.0 81.7 80.7 80.0

14.9 14.2 16.3 37.3 54.8 80.3 83.0 81.8 81.4 80.6 79.8

14.1 14.0 15.9 39.5 55.4 79.9 82.5 82.0 81.4 80.5 79.9

14.2 14.9 24.0 41.3 68.0 79.9 82.1 82.1 80.9 79.8 79.2

14.1 14.2 30.8 47.2 59.8 58.5 54.4 50.7 50.1 48.1 47.1 15

30

45

60

75

Figure 14: Mean classification accuracy of 100 runs of ourNDCN model over terminal time and α for the Cora datasetin heatmap plot.

Alpha

0.00.2

0.40.6

0.81.0

Termina

l Time

0.50.7

0.91.1

1.31.5

Accu

racy

0.130.210.290.370.440.520.600.680.760.83

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 15: Mean classification accuracy of 100 runs of ourNDCN model over terminal time and α for the Cora datasetin 3D surface plot.

Table 9: Continuous-time Interpolation Prediction. Our NDCN predicts different continuous-time network accurately. Each resultis the `1 error with standard deviation from 20 runs for 3 dynamics on 5 networks for each method.


HeatDiffusion




GeneRegulation


Table 10: Regularly-sampled Extrapolation Prediction. Our NDCN predicts different structured sequences accurately. Each resultis the `1 error with standard deviation from 20 runs for 3 dynamics on 5 networks for each method.


HeatDiffusion




GeneRegulation


Table 11: Statistics for three real-world citation networkdatasets. N, E, D, C represent number of nodes, edges, fea-tures, classes respectively.

Dataset N E D C Train/Valid/Test

Cora 2, 708 5, 429 1, 433 7 140/500/1, 000Citeseer 3, 327 4, 732 3, 703 6 120/500/1, 000Pubmed 19, 717 44, 338 500 3 60/500/1, 000 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

Terminal Time

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Alph

a

13.9 13.5 17.1 19.1 39.5 67.2 71.9 71.4 69.8 68.4 66.5

16.0 14.4 16.9 18.7 40.9 67.8 72.2 71.5 70.6 68.9 66.5

16.5 13.1 17.3 18.8 42.2 68.6 72.2 71.5 70.7 68.9 66.4

16.4 15.6 17.6 18.7 43.7 69.5 72.3 71.6 70.9 68.9 66.3

16.5 15.4 18.0 19.0 43.6 69.9 72.2 71.6 71.0 69.2 65.7

16.8 16.7 17.7 19.8 43.5 70.4 72.2 71.6 71.1 69.6 65.5

17.1 16.6 18.5 20.5 43.3 71.3 72.2 71.5 71.2 69.4 65.5

15.4 16.2 16.6 19.8 47.6 72.7 72.4 71.4 70.7 68.5 65.9

17.0 16.9 17.2 18.8 58.3 73.1 72.3 71.1 70.0 68.2 66.0

16.6 16.5 16.6 17.4 63.4 72.0 71.5 69.9 68.3 67.0 64.9

16.4 16.5 16.6 33.3 61.4 60.1 57.1 55.0 53.7 51.4 49.4 15

30

45

60

Figure 16: Mean classification accuracy of 100 runs of ourNDCN model over terminal time and α for the Citeseerdataset.

Alpha

0.00.2

0.40.6

0.81.0

Termina

l Time

0.50.7

0.91.1

1.31.5

Accu

racy

0.130.200.260.330.400.460.530.600.660.73

0.2

0.3

0.4

0.5

0.6

0.7

Figure 17: Mean classification accuracy of 100 runs of ourNDCN model over terminal time and α for the Citeseerdataset in 3D surface plot.

0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5Terminal Time

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Alph

a

53.2 54.1 57.9 73.1 77.6 78.7 78.6 79.2 79.2 78.6 78.0

54.0 55.1 63.1 74.7 78.0 79.1 79.2 79.4 79.5 79.0 78.1

54.5 55.9 68.2 75.9 78.2 79.4 79.4 79.5 79.3 79.1 78.2

55.0 56.7 72.3 76.6 78.5 79.4 79.7 79.6 79.1 79.0 78.2

55.8 60.7 74.4 77.1 78.7 79.6 79.8 79.6 79.2 78.8 78.3

56.7 69.2 75.4 77.3 78.7 79.3 79.8 79.7 79.3 78.9 78.2

60.3 72.7 75.8 77.3 78.7 79.3 79.6 79.6 79.1 78.6 77.9

69.0 74.0 76.0 77.4 78.7 79.3 79.6 79.6 78.6 77.9 77.3

72.7 74.6 76.0 77.4 78.7 78.8 79.2 78.8 78.4 77.5 76.8

73.7 74.6 75.7 77.1 78.0 78.5 78.2 77.9 76.9 76.9 76.3

73.3 73.7 73.8 74.0 73.9 73.3 71.4 69.2 68.3 67.0 65.6 55

60

65

70

75

Figure 18: Mean classification accuracy of 100 runs of ourNDCN model over terminal time and α for the Pubmeddataset.

Alpha

0.00.2

0.40.6

0.81.0

Termina

l Time

0.50.7

0.91.1

1.31.5

Accu

racy

0.530.560.590.620.650.680.710.740.770.80

0.55

0.60

0.65

0.70

0.75

Figure 19: Mean classification accuracy of 100 runs of ourNDCN model over terminal time and α for the Pubmeddataset in 3D surface plot.

neural dynamics on complex networks...we ﬁrst introduce a differential equation system which mod-...

Documents