exploiting graph regularized multi-dimensional hawkes ...3 graph regularized multi-dimensional...

Exploiting Graph Regularized Multi-dimensional Hawkes Processesfor Modeling Events with Spatio-temporal Characteristics

Yanchi Liu1, Tan Yan2, Haifeng Chen2

1 Rutgers University, Newark, NJ, USA2 NEC Labs America, Princeton, NJ, USA

[email protected], [email protected], [email protected]

Abstract

Multi-dimensional Hawkes processes (MHP) hasbeen widely used for modeling temporal events.However, when MHP was used for modeling eventswith spatio-temporal characteristics, the spatial in-formation was often ignored despite its importance.In this paper, we introduce a framework to ex-ploit MHP for modeling spatio-temporal eventsby considering both temporal and spatial informa-tion. Specifically, we design a graph regulariza-tion method to effectively integrate the prior spatialstructure into MHP for learning influence matrixbetween different locations. Indeed, the prior spa-tial structure can be first represented as a connec-tion graph. Then, a multi-view method is utilizedfor the alignment of the prior connection graph andinfluence matrix while preserving the sparsity andlow-rank properties of the kernel matrix. Moreover,we develop an optimization scheme using an alter-nating direction method of multipliers to solve theresulting optimization problem. Finally, the exper-imental results show that we are able to learn theinteraction patterns between different geographicalareas more effectively with prior connection graphintroduced for regularization.

1 IntroductionLarge-scale spatio-temporal data have been accumulated invarious applications, such as criminal records from municipalsystems and bike rental transactions from bike sharing sys-tems. These fine-grained data can naturally be regarded as se-quences of events over different locations, where each even-t may trigger or influence a series of subsequent events un-der certain intrinsic spatial structure due to their geographicalproximity. For example, a pickpocket may attempt to repli-cate a previous success in nearby locations in the followingdays. A transaction by a traveler at bike station C may trig-ger transactions in other stations along this traveler’s frequentroutes. Such influence pattern reflects the nature and seman-tics of human behaviors and should be properly modeled forthe success of human-centric applications [Wang et al., 2017;Liu et al., 2017; Zhang et al., 2017; Liu et al., 2014].

Recently, multi-dimensional Hawkes processes (MHP) hasdrawn great attention for analyzing influence patterns be-tween sequential events because of its mutual-exciting prop-erties. It can be modeled by learning an influence matrix thatcaptures the mutual triggering weight across different dimen-sions, i.e., event sources, and a time decay function. However,when learning the influence matrix, existing studies usuallydo not consider spatial information and mainly focus on mod-eling temporal dependencies between events, which either fitthe native unconstrained MHP [Embrechts et al., 2011] to theobserved data, or jointly fit MHP with other processes [Liet al., 2014]. These approaches generally assume having nostructural knowledge across event dimensions, in which theinfluence matrix is estimated such that event’s arrival time isbest fitted. With the consideration of prior knowledge, a pio-neer work has been done by Zhou et. al [Zhou et al., 2013]to incorporate social structure as constrains to learn influencepatterns between users in social networks. It imposes sparsi-ty and low-rank properties that are common assumptions insocial structures to learn a more reasonable and interpretableinfluence pattern.

Despite the remarkable success in modeling temporalevents, existing MHPs studies cannot be directly applied foranalyzing spatio-temporal data, since MHPs do not take thespatial constraints, a key prior knowledge in spatio-temporalmodeling, into account. According to the well-known To-bler’s first law of geography [Tobler, 1970], everything is re-lated to everything else, but nearby things are more relatedthan distant things. For example, a theft at a street block ismuch more likely to relate to another theft in nearby blocksthan others happened 20 miles away. A burst of bike rentalsat 2nd street Manhattan is more likely to influence the rentalload in 5th street than in Queens. Such spatial laws and con-strains serve as valuable knowledge to better understand theintrinsic structure of spatio-temporal events and deserve goodconsideration in modeling.

In this paper, we design Graph regularized Multi-dimensional Hawkes Process (GMHP) to incorporate spatialstructure knowledge to learn the influence patterns of spatio-temporal events. Our idea is to represent the knowledge ofspatial structures as a graph, and learn a influence pattern be-tween event locations by both maximizing the likelihood ofHawkes process and satisfying the constrains from the graph.Specifically, both the graph and the influence pattern can be

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

2475

formed as a matrix, where each entry represents the prior pair-wise spatial knowledge and the influence weight we want tolearn between event locations, respectively. To satisfy the s-patial constrains, a native formulation would be directly com-paring those two matrices, i.e., minimizing the distance be-tween them. However, the influence pattern and spatial con-strains are measured from different domains with differentvalue ranges and underlying physical meanings. For example,influence weight is modeled from temporal sequence fromHawkes process and usually has to be a nonnegative valuein a certain range, while the link weight in the graph can beany arbitrary value, such as {0, 1} for representing KNN re-lationship, floating numbers for location distance, and morefor complex structures. This makes those two matrices mostlikely lay in different spaces, and not directly comparable.

To address this issue, we adopt the idea from multi-viewlearning, a powerful framework for fusing data and featuresfrom diverse domains. We regard the spatial graph and influ-ent matrix as two views, and align them by exploiting theirsubspace embeddings. Then the spatial graph constrains isimposed in learning influence pattern by obtaining a sharedsubspace for both matrices. More specifically, considering thenature sparsity and low rank characteristics of the influencematrix, we adopt Nonnegative Matrix Factorization (NMF)methods to learn the shared subspace, and two correspond-ing projection matrices for both matrices. Furthermore, wedevelop an optimization scheme using Alternating DirectionMethod of Multipliers (ADMM) [Boyd et al., 2011] frame-work to solve the objective function based on iterative updatesof the parameters. In our experiments on both synthetic andreal-world data, GMHP outperforms state-of-the-art methodsin terms of various metrics.

2 Related Work

2.1 Spatio-temporal Data Modeling

Extensive research has been done to model the relationshipsbetween event sequences generated from different location-s. To predict traffic flows, [Kamarianakis et al., 2012] usesregime-switching space-time models and selection operatorto predict the real-time road traffic. [Lippi et al., 2013] com-pares a few typical methods on forecasting traffic flow andproposes a seasonal support vector machine function to im-prove the accuracy and computing efficiency. For POI rec-ommendation, the geographical information is usually inte-grated to recommendation models like collaborative filtering(CF) with a geographical regularization [Liu et al., 2015].Ye et al. [Ye et al., 2011] consider the social influence un-der the framework of a user-based CF model and models thegeographical influence by a Bayesian CF model. [Liu et al.,2016] proposes a bi-weighted low-rank graph constructionmodel, which integrates users’ interests and sequential pref-erences with temporal interval assessment. The above workmainly focuses on specific applications but we focus on amore general problem that is to model the generation and re-lationship of spatio-temporal events.

2.2 Applications of Hawkes ProcessHawkes process has been widely used to model temporal andrecurrent events. [Marsan and Lengline, 2008] applies theHawkes process for earthquake interaction with cascade trig-gering. Yang et al. [Yang and Zha, 2013] propose a gener-alized model to track the meme and the corresponding dif-fusion network by combining mixture of Hawkes processesand a language model. Li et al. [Li et al., 2014] combinetopic model with Hawkes process to simultaneously identi-fy and label the searching tasks. Li et al. [Li and Zha, 2016]model the influence between householders’ energy usage be-haviors directly through a probabilistic model, which com-bines topic models with the Hawkes processes. Xu et al. [Xuet al., 2017] focus on an important problem of predicting theso-called “patient flow” from longitudinal electronic healthrecords. Prior knowledge of social networks, such as sparsityand low rank structure are considered in [Zhou et al., 2013] tomodel the peer influence between users. The aforementionedwork either focuses on learning temporal dependences, or in-tegrating simple prior knowledge into model designing. Noneof them considers complex constrains, especially geographicgraph structures in modeling Hawkes process.

3 Graph Regularized Multi-dimensionalHawkes Process

In this section, we present our method by first introducingthe background of Hawkes process and its self-exciting char-acteristics. Then, we present our model, Graph regularizedMulti-dimensional Hawkes Process (GMHP), which incorpo-rates spatial knowledge by adding graph constrains into themodeling of Hawkes Process. After that, we detail the opti-mization algorithm for GMHP based on Alternating DirectionMethod of Multipliers (ADMM) and illustrate the updatingprocess for each part.

3.1 Background of Hawkes ProcessPoint process is a stochastic process that models the events ina time sequence withN events {E1, E2, · · · , EN}, where theevent Ei occurs at time ti. We use N(t) to denote the numberof events happening in the time interval (− inf, t] and Ht todenote the set of events happening before t. The future eventscan be characterized by a conditional intensity function asfollows,

λ(t|Ht) = lim∆t→0

E(N(t+ ∆t)−N(t)|Ht)

∆t.

The conditional intensity function depicts the expected rateof event at time t conditioned on the past events.

A Hawkes process is a point process having a self-excitingproperty, which means the previous events trigger the occur-rences of future events. More formally, the univariate Hawkesprocess is defined as follows,

λ(t) = µ+

∫ t

−∞g(t− s)dN(s),

where µ is the base intensity, g(·) is a kernel function de-scribing the triggering or influence effects of past events oncurrent event.


2476

The univariate Hawkes process can be extended to multi-dimensional Hawkes Process (MHP) to handle multiple typesof events happening sequentially. In this case, events of differ-ent dimensions are triggering each other in addition to trigger-ing themselves. Specifically, given D-dimensional event se-quences, for dimension i, the conditional intensity functionsis defined as follows,

λi(t) = µi +

D∑j=1

aij

∫ t

−∞g(t− s)dNj(s)

where µi is the base intensity of the dimension i, aij ∈ Ameasures the influence of dimension j to dimension i, andA ∈ RD×D

+ is the influence matrix across different eventdimensions.

3.2 MHP with Graph RegularizationIn this section, we formulate the problem of modelingspatio-temporal events using MHP with graph regularization.Specifically, we first derive the likelihood function of MHP,which serves as a loss function measuring the fitness. Thenwe use a connection matrix to represent the spatial structuralknowledge across different event dimensions, and further dis-cuss its sparse and low-rank properties. Such a connectionmatrix is used as constrains to guide the learning of influ-ence matrix A. We maximize the similarity between the twomatrices by finding a best subspace alignment between them.After that, considering all aforementioned factors, we followthe multi-view subspace learning framework to formulate theobjective function.

We consider in total N spatio-temporal events observed inD different locations during a time period of [0, T ]. This D-dimensional event data can be represented as {(ti, di)}Ni=1,where each pair (ti, di) represents an event occurring at thedi-th dimension at time ti, and a geometrical position (xd, yd)is associated with the d-th dimension.

For the d-th dimension, the intensity at t is as follows,

λd(t) = µd +

D∑d′=1

∑td

′i <t

add′ · g(t− td′

i ).

And the log likelihood is

L(A, µ) =

D∑d=1

{−µdT −D∑

d′=1

∑td

′i <T

add′ ·G(T − td′

i )

+∑tdi <T

log[µd +

D∑d′=1

∑td

′j <tdi

add′ · g(tdi − td′

j )]}

where

g(t) = we−wt, and G(t) =

∫ t

0

g(s)ds = 1− e−wt.

Graph ConstrainsWe consider spatial information of different event dimen-sions/sources that pervasively exists in spatio-temporal ap-plications, such as location of each event source, relation-ship/similarity of a pair of sources, etc. Each event dimen-

sion/source can be abstracted as a node, and such spatial in-formation can naturally be represented as a graph. We intro-duce a connection matrix E = {edd′} ∈ RD×D to repre-sent this intrinsic geographical structure of spatio-temporalevents, where different similarity measures can be used fordifferent scenarios. For example, 0-1 weighting is one of thesimplest and most used weighting methods, where edd′ = 1if and only if nodes d and d′ are connected by an edge, and 0otherwise. The edges between nodes can be effectively mod-eled through a k-nearest neighbor graph. Or, if events involvemultiple nodes we can model the edges by community. Forexample, bike rental stations along traveler’s frequent routescan have edges with each other. Moreover, such a graph canalso represent structural relationship in complex systems [Caiet al., 2011] where nodes are connected if their functionsare related. In addition to 0-1 graph, floating weights can beadded into the edge, which represent prior knowledge to therelationship between nodes, e.g., distance, similarity, etc. S-ince E in our paper is only for measuring the relationship, ourproposed method can be used for different weighting schemeswith no modification.

Sparsity and Low-rankIn spatio-temporal applications, in practice, an event gener-ally influence nearby events but hardly influence events far-away, which is summarized by the well-known Tobler’s firstlaw of geography [Tobler, 1970]. Reflected in both spatialconnection matrix E and influence matrix A, nodes are ex-pected to connect to a small portion of all the nodes, whichindicates the sparsity property of both matrices. For the samereason, events in different locations mutually excite each oth-er, which forms communities. The community structure inthe influence network implies the low-rank property in bothE and A, which is also discovered in social networks by[Zhou et al., 2013]. In light of such findings, both A andE should satisfy these two properties when modeling eventswith spatio-temporal characteristics.

Objective Function FormulationWe integrate the prior structural knowledge into modelingspatio-temporal events by imposing the connection graph Eas constrains in learning influence matrix A. An intuitive wayto achieve this is to directly compare E and A and minimizetheir distance, i.e., min ‖A − E‖. However, in practice, Aand E are often measured from different domains with dif-ferent value ranges and underlying physical meanings. Theinfluence weight is modeled from temporal sequence fromHawkes process and usually has to be a nonnegative value ina certain range, while the link weight in E can be any arbi-trary values in different scenarios as mentioned in the previ-ous section. This makes the two matrices most likely lay indifferent spaces, and not directly comparable. Moreover, thesparse and low-rank characteristics of A and E also makecomputing ‖A−E‖ not meaningful.

We adopt multi-view subspace learning framework [Xu etal., 2013] to resolve this issue. We treat both A and E asdifferent information sources and try to find best alignmentbetween them. Specifically, we first transfer A and E to thesame subspace U ∈ RD×K ,K ≤ D with

A ≈ UV1,E ≈ UV2


2477

and compare V1 ∈ RK×D and V2 ∈ RK×D in the subspace.Here all matrices A, E, U, V1, V2 are nonnegative, and thecolumns of basis U are normalized (sum up to 1). In thisway, we are able to align multiple information sources, i.e.,A and E, and exploit the discriminative low-dimensional em-bedding via Nonnegative Matrix Factorization (NMF). Here,NMF is suitable for this subspace learning task as it providesa non-global basis set which intuitively contains the localizedparts of objects, e.g., the community structure.

To this end, we propose our Graph regularized Multi-dimensional Hawkes Process (GMHP) algorithm with the ob-jective function shown as follows,

min − L(A,µ) + λ1(‖A−UV1‖2 + ‖E−UV2‖2)

+ λ2‖V1 −V2‖2 + λ3(‖V1‖1 + ‖V2‖1)

s.t.A ≥ 0,µ ≥ 0,U ≥ 0,V1 ≥ 0,V2 ≥ 0 (1)

In this objective function, the first term maximizes the loglikelihood of Hawkes process. The rest of the terms embedinfluence matrix A and prior connection graph E to a nor-malized low-rank subspace U with NMF, and measure thealignment of their views V1 and V2 with l2-norm. Here ui·,the i-th row of U, can be treated as the influence acceptancerate of node i, and v·j , the j-th column of V, is the influencegenerated from node j. Moreover, l1-norm of the views areused to enforce the sparsity of the matrices A and E. The pa-rameters λ1, λ2, and λ3 are introduced to control the strengthof the regularization terms.

3.3 The Optimization AlgorithmIn this section, we present an optimization algorithm to solveProblem (1). We adopt the Alternating Direction Method ofMultipliers (ADMM) framework [Boyd et al., 2011] to devel-op our solution for the following reasons: 1) In the objectivefunction we have l1-norm regularizations, which is hard tosolve by other optimization methods. And ADMM is able tosolve this problem well. 2) ADMM is good at solving large-scale distributed optimization problem which is able to makeour model scaling to big datasets. Next, we first describe theADMM framework and its typical formulation. Then, we re-formulate Problem (1) to an optimization problem with linearequality constraints involving separable classes of variables,such that it can be solved under the setting of ADMM. Afterthat, we design our algorithm to iteratively solve the problem.

Adoption for ADMMA typical ADMM problem formulation is shown as follows:

minx1,x2

{f(x1) + g(x2) : A1x1 = A2x2}, (2)

where f, g are objective functions, and A1, A2 are linearcoefficient matrices. To solve such a problem, the ADMMworks by iteratively updating x1 and x2 in an alternatingmanner that steers (x1,x2) progressively closer to the op-timality condition.

We now begin to adapt Problem (1) to the form of (2) sothat ADMM can be applied. We first rewrite the optimizationproblem to an equivalent form by introducing an auxiliaryvariable Z,

minA≥0,µ≥0

− L(A,µ) + λ1(‖Z−UV1‖2 + ‖E−UV2‖2)

+ λ2‖V1 −V2‖2 + λ3(‖V1‖1 + ‖V2‖1).

In ADMM, we optimize the augmented Lagrangian of theabove problem that can be expressed as follows:Lp =− L(A,µ) + λ1(‖Z−UV1‖2 + ‖E−UV2‖2)

+ λ2‖V1 −V2‖2 + λ3(‖V1‖1 + ‖V2‖1)

+ ρtrace(WT (A− Z)) +ρ

2‖A− Z‖2.

where ρ is the penalty parameter, trace is the trace of ma-trix, and W is the dual variable associated with the constraintA = Z. After reformulation, we begin to solve the aboveaugmented Lagrangian problem with an iterative algorithm,which updates A and µ, Z, W, U, V1, and V2 iteratively.

The algorithm involves the following key iterative steps:

{A,µ}k+1 = argminLp(A,µ,Zk,Uk,Vk1 ,V

k2 ,W

k)

Zk+1 = argminLp(Ak+1,µk+1,Z,Uk,Vk1 ,V

k2 ,W

k)

Wk+1 =Wk + (Ak+1 − Zk+1)

{U,V1,V2}k+1 = argminLp(Ak+1,µk+1,Zk+1,Uk,Vk1 ,V

k2 ,

Wk+1)

The following sections describes the detailed procedures toupdate each variable.

Solving A and µThe optimization problem for A and µ are as follows,

Ak+1,µk+1 = argminA≥0,µ≥0

f(A,µ),

wheref(A,µ) = −L(A,µ) +

ρ

2‖A− Zk + Wk‖2.

We update A and µ by a majorization-minimization algo-rithm. Given the parameters of k-th iteration, we minimize asurrogate function which is a tight upper bound of f(A,µ)as follows.

Q(A,µ) =−N∑i=1

(piilogµdi

pii+

i−1∑j=1

pij logadidjg(ti − tj)

pij)

− T∑d

µd +

M∑d=1

N∑j=1

addjG(T − tj) +ρ

2(‖A− Z + W‖2),

wherepii =

µdi

µdi +∑i−1

j=1 adidjg(ti − tj),

pij =adidjg(ti − tj)

µdi +∑i−1

j=1 adidjg(ti − tj).

By optimizing Q we can solve A and µ with closed formsolutions as follows,

µm+1d =

∑i:i≤N pii

T,

am+1ij =

−B +√B2 + 8ρC

4ρ,

whereB =

∑tj<T

(G(T − tj)) + ρ(−zij + wij),

C =

N∑i=1,di=d

∑j<i,dj=d′

pij .


2478

Solving ZWhen solving Z, the relevant terms from Lp are

λ1‖Z−UkVk1‖2 +

ρ

2‖Ak+1 − Z + Wk‖2,

which has a closed form solution as follows

Zk+1 =2λ1U

kVk1 + ρ(Ak+1 + Wk)

2λ1 + ρ.

Solving U,V1,V2When solving U,V1,V2, the relevant terms from Lp are

L = λ1(‖Z−UV1‖2 + ‖E−UV2‖2)

+ λ2‖V1 −V2‖2 + λ3(‖V1‖1 + ‖V2‖1).

The cost function L above is not convex with respect to U,V1, and V2 together. We derive a multiplicative iterative al-gorithm to solve this problem.

It is worth noting that minimizing the cost function L issubject to U ≥ 0, V1 ≥ 0, and V2 ≥ 0. Let Υ ≥ 0, Φ ≥ 0,and Ψ ≥ 0 be the corresponding Lagrange multipliers, weconsider the lagrange L′ as

L′ = λ1(‖Z−UV1‖2 + ‖E−UV2‖2)

+ λ2‖V1 −V2‖2 + λ3(‖V1‖1 + ‖V2‖1)

+ Tr(ΥUT ) + Tr(ΦVT1 ) + Tr(ΨVT

2 ).

Then, by taking the partial derivative with respect to U, V1,V2, we have∂L′

∂U= −2λ1ZV

T1 +2λ1UV1V

T1 −2λ1EVT

2 +2λ1UV2VT2 +Υ

∂L′

∂V1= −2λ1U

TZ+ 2λ1UTUV1 + 2λ2V1− 2λ2V2 +λ3 + Φ

∂L′

∂V2= −2λ1U

TE+ 2λ1UTUV2 + 2λ2V1− 2λ2V2 +λ3 + Ψ

According to the KKT conditions Υ. ∗U = 0, Φ. ∗V1 = 0,Ψ. ∗ V2 = 0, where .∗ means the Hadamard product, weobtain the updating rules for solving the problem as follows,U = U. ∗ (ZVT

1 + EVT2 )./(UV1V

T1 + UV2V

T2 )

V1 = V1. ∗ (2λ1UTZ− 2λ2V2)./(2λ1U

TUV1 + 2λ2V1 + λ3)

V2 = V2. ∗ (2λ1UTE + 2λ2V1)./(2λ1U

TUV2 − 2λ2V2 + λ3)

Algorithm 1 summarizes this optimization algorithm.

4 Experimental ResultsIn this section, we evaluate the performances of GMHP onboth synthetic and real-world data.

4.1 Data DescriptionThe synthetic dataset is used to show that our proposed al-gorithm can reconstruct the underlying parameters from ob-served spatio-temporal events. To this end, we construct a D-dimensional Hawkes process with µ generated from a unifor-m distribution on [0, 0.001] and A generated by integratinglocal clusters. We scale A such that the spectral radius of Ais 0.8 to ensure the point process is well-defined. In this ex-periment, we set D = 1000, and 50000 samples are sampledfrom the specified Hawkes process. We generate the influencematrix A following the same manner as in [Zhou et al., 2013].Specifically, A is generated by A = WHT to consider thefollowing two different types of influences in the data.

Algorithm 1: GMHP Learning AlgorithmInput : multiple event sequences, E, and parametersOutput: A,µ

1 Initialize µ,A,U,V1,V2 randomly2 Normailize each column of U3 Z = A, W = 04 repeat5 k = k + 16 repeat7 Update A,µ8 until convergence9 Update Z,W

10 repeat11 Update U,V1,V2

12 Normailize each column of U13 until convergence14 until convergence15 return A,µ

• Assortative mixing: W and H are both 1000× 9 matri-ces with entries in [100(i − 1) + 1 : 100(i + 1), i], i =1, · · · , 9 sampled randomly from [0, 0.1] and all otherentries are set to zero. Assortative mixing data is ableto demonstrate influences coming from members of thesame group.

• Disassortative mixing: W is generated in the same wayas the assortative mixing case, while the H has none-zero entries in [100(i− 1) + 1 : 100(i+ 1), 10− i], i =1, · · · , 9 randomly sampled from [0, 0.1]. Disassorta-tive mixing data can demonstrate influences from othergroups.

And E is generated using the same method but the non-zero entries of W and H are set to 1.The crime dataset is extracted from the Chicago Police De-partment’s CLEAR system1, which records reported inci-dents of crime that occurred in the City of Chicago from 2001to 2017. This dataset contains 6.39 million crime records,with each record includes id, timestamp and geo-coordinates,etc. Here we treat each beat (the smallest police geographicarea) as an event source, and model the crime flow for eachbeat. Totally we have 279 beats in Chicago.The Citibike dataset is generated by NYC Bike Sharing Sys-tem2. This dataset contains station id, bicycle pick-up/drop-off stations, and bicycle pick-up/drop-off timestamps. In totalwe have 617 stations with 9.4 million transactions in 2017.Here we treat each station as an event source, and model thebike rental flow for each station. To discover bike rental be-havior, we construct the prior spatial connection graph withcommunity detection methods, with the transaction numbersas weights between stations.

4.2 BaselinesWe compare the proposed method GMHP with the followingbaselines.

1https://data.cityofchicago.org2https://www.citibikenyc.com


2479

(a) NegLoglik (b) RankCorr (c) RelErrFigure 1: Assortative data

(a) NegLoglik (b) RankCorr (c) RelErrFigure 2: Disassortative data

• Full. The regular MHP method with no structures as-sumed in the influence matrix.• Sparse. The sparsity constraint is introduced to regular-

ize the structure of the influence matrix A.• LowRank. The low-rank constraint is introduced to reg-

ularize the structure of the influence matrix A.• ADM4. The low-rank and sparsity constraints are intro-

duced to regularize the structure of the influence matrixA [Zhou et al., 2013].

Moreover, to show the robustness of using spatial connec-tion matrix E, we further explore the performance of GMHPwith random and incomplete connection matrices.

• GMHP-R. The random connection matrix is introducedto learn the structure of the influence matrix A. We gen-erate the random connection matrix E with the samenumber of non-zero items as in A but randomly dis-tributed and set to 1.• GMHP-I. The incomplete connection matrix is intro-

duced to learn the structure of the influence matrix A.We generate the incomplete connection matrix E withhalf of the non-zero items in E randomly removed.

4.3 Evaluation MetricsWe use the following metrics to evaluate the performances ofdifferent methods.

• NegLoglik. The negative log-likelihood of testing datausing the trained model.• RankCorr. The averaged Kendall’s rank correlation co-

efficient between each row of the real A and that of theestimated A.

• RelErr. RelErr is defined as the averaged relative errorbetween the estimated parameters and the true parame-ters, i.e., |aij−aij |

|aij | for aij 6= 0 and |aij−aij | for aij = 0.

4.4 Results on Synthetic DataWe first study the influences of different parameter settings,and then present the performances of the models.

Parameter StudyTable 1 shows the performance of GMHP with respect to thevalues of the three parameters λ1, λ2, and λ3. From the figurewe can see that with the values of parameters grow, the perfor-mance of GMHP first decreases and then increases. And thebest performance is achieved when λ1 = 0.8, λ2 = 100, andλ3 = 0.8 for the synthetic data set. Also the performance ispretty stable and we can say that our proposed GMHP methodcan perform well generally.

λ1 0.0 0.2 0.4 0.6 0.8 1.0NegLoglik 5.91 5.23 5.21 5.15 4.97 5.13log10 λ2 -1 0 1 2 3 4NegLoglik 14.1 6.23 5.12 4.54 5.11 6.31λ3 0.0 0.2 0.4 0.6 0.8 1.0NegLoglik 5.06 4.95 4.93 4.91 4.90 4.91

Table 1: NegLoglik (×103) w.r.t. regularization parameters

PerformancesFigure 1 plots the results on assortative mixing data measuredby NegLoglik, RankCorr, and RelErr with respect to the num-ber of training data. It can be observed from the results that


2480

when the number of training samples increases, the RankCor-r increases and both NegLoglik and RelErr decrease, indi-cating that all methods can improve accuracy of estimationwith more training samples. Moreover, GMHP outperformsthe baseline methods with all three metrics. We can see theprior graph structure knowledge can lead the learning of pa-rameters and hence improve the estimation significantly. Wecan also see from the results that the improvements of GMH-P are even larger when the sample sizes are small. This isbecause GMHP can leverage the prior graph structure knowl-edge, but the other methods may only integrate some assump-tions about the influence patterns between events but no geo-graphical or community information is considered. This fea-ture of GMHP that can learn from fewer events makes GMHPeffectively prevent over-fitting.

Similarly in Figure 2 GMHP outperforms other baselinemethods on disassortative mixing data. These two experi-ments demonstrate that GMHP can effectively integrate theprior graph structure into learning.

RobustnessFigure 3(a) plots the results on assortative mixing data mea-sured by NegLoglik with respect to the number of trainingdata. It can be observed from the results that when the num-ber of training samples increases, the NegLoglik decrease, in-dicating that all methods can improve accuracy of estimationwith more training samples. At the same time, GMHP-R andGMHP-I still obtain stable performances even with randomand incomplete connection matrices. Similarly in Figure 3(b)GMHP-R and GMHP-I obtain stable performances on disas-sortative mixing data. These two experiments demonstrate therobustness of GMHP.

(a) Assortative data (b) Disassortative dataFigure 3: Robustness

4.5 Results on Real DataWe apply GMHP to model two real-world datasets, crime da-ta and bike rental data, and compare it with Full, LowRank,Sparse, and ADM4.Crime Data In this dataset, each police beat is considered asan event source, where each event indicates a crime record.To construct the connection matrix E for the crime data, weutilize the community information contained in the dataset.The beats are assigned to 77 communities, according to theactivities of local residents. And we construct E by settingitems between beats in the same community as 1, and 0 forothers. We can say the beats in the same community are morerelated than the beats outside the community.

In total we have 196 months’ data, and we learn the mod-els from the first n months and test them on the data from

(a) Crime data (b) Bike dataFigure 4: Performances on real data

the n+ 1 month. Similar to the experiments on synthetic da-ta, we obtain the NegLoglik with respect to different trainingperiods which is shown in Figure 4(a). From the results wecan see GMHP obtains better performance than other base-line methods.Bike Data In this dataset, each bike station is considered asan event dimension/source, where each event indicates a re-nal transaction. To construct the connection matrix E for thebike data, we count start/end station pairs in bike rental trans-actions. We construct a network, with the bike stations as n-odes, and the number of transactions as weights of edges be-tween node pairs. Then a modularity-based community detec-tion method is used to detect communities between stations.The stations in the same community are more related than thestations outside the community.

In total we have 180 days’ data, and we we learn the mod-els from the first n days and test them on the data from then + 1 days. Similar to the experiments on synthetic data andcriminal data, we obtain the NegLoglik with respect to dif-ferent training periods which is shown in Figure 4(b). Fromthe results we can see GMHP obtains better performance thanother baseline methods.

In real applications, there may not have many events col-lected, for example, the crime events in a city happening ev-eryday is limited. And that’s also the reason why our pro-posed GMHP can be much useful in real applications withsmall number of events collected.

5 ConclusionIn this paper, we developed a framework for modeling spatio-temporal events with graph regularized multi-dimensionalHawkes process (GMHP). In the proposed GMHP frame-work, a graph regularization method was designed to inte-grate the prior spatial structure into MHP for learning influ-ence matrix between different locations. Specifically, the pri-or spatial structure is first represented as a connection graph.Then, a multi-view method is utilized for the alignment of theprior connection graph and influence matrix while preserv-ing the sparsity and low-rank properties of the kernel matrix.Moreover, we developed an optimization scheme using Al-ternating Direction Method of Multipliers (ADMM) to solvethe resulting optimization problem. Finally, the experimentalresults have shown that our method can learn the interactionpatterns between different geographical areas with prior con-nection graph introduced for regularization, and GMHP caneffectively model spatio-temporal events.


2481

References[Boyd et al., 2011] S. Boyd, N. Parikh, E. Chu, B. Peleato,

and J. Eckstein. Distributed optimization and statisticallearning via the alternating direction method of multiplier-s. Foundations and Trends R© in Machine Learning, 2011.

[Cai et al., 2011] Deng Cai, Xiaofei He, Jiawei Han, andThomas S Huang. Graph regularized nonnegative ma-trix factorization for data representation. IEEE Trans-actions on Pattern Analysis and Machine Intelligence,33(8):1548–1560, 2011.

[Embrechts et al., 2011] Paul Embrechts, Thomas Liniger,and Lu Lin. Multivariate hawkes processes: an appli-cation to financial data. Journal of Applied Probability,48(A):367–378, 2011.

[Kamarianakis et al., 2012] Yiannis Kamarianakis, WeiShen, and Laura Wynter. Real-time road traffic fore-casting using regime-switching space-time models andadaptive lasso. Applied stochastic models in business andindustry, 28(4):297–315, 2012.

[Li and Zha, 2016] Liangda Li and Hongyuan Zha. House-hold structure analysis via hawkes processes for enhancingenergy disaggregation. In IJCAI, pages 2553–2559, 2016.

[Li et al., 2014] Liangda Li, Hongbo Deng, Anlei Dong, Y-i Chang, and Hongyuan Zha. Identifying and labelingsearch tasks via query-based hawkes processes. In Pro-ceedings of the 20th ACM SIGKDD international con-ference on Knowledge discovery and data mining, pages731–740. ACM, 2014.

[Lippi et al., 2013] Marco Lippi, Matteo Bertini, and PaoloFrasconi. Short-term traffic flow forecasting: An experi-mental comparison of time-series analysis and supervisedlearning. IEEE Transactions on Intelligent TransportationSystems, 14(2):871–882, 2013.

[Liu et al., 2014] Qi Liu, Enhong Chen, Hui Xiong, YongGe, Zhongmou Li, and Xiang Wu. A cocktail approachfor travel package recommendation. IEEE Transactions onKnowledge and Data Engineering, 26(2):278–293, 2014.

[Liu et al., 2015] Bin Liu, Hui Xiong, Spiros Papadimitriou,Yanjie Fu, and Zijun Yao. A general geographical proba-bilistic factor model for point of interest recommendation.IEEE Transactions on Knowledge and Data Engineering,27(5):1167–1179, 2015.

[Liu et al., 2016] Yanchi Liu, Chuanren Liu, Bin Liu, MengQu, and Hui Xiong. Unified point-of-interest recommen-dation with temporal interval assessment. In Proceedingsof KDD, pages 1015–1024. ACM, 2016.

[Liu et al., 2017] Yanchi Liu, Chuanren Liu, Xinjiang Lu,Mingfei Teng, Hengshu Zhu, and Hui Xiong. Point-of-interest demand modeling with human mobility pattern-s. In Proceedings of the 23rd ACM SIGKDD Internation-al Conference on Knowledge Discovery and Data Mining,pages 947–955. ACM, 2017.

[Marsan and Lengline, 2008] David Marsan and OlivierLengline. Extending earthquakes’ reach through cascad-ing. Science, 319(5866):1076–1079, 2008.

[Tobler, 1970] Waldo R Tobler. A computer movie simulat-ing urban growth in the detroit region. Economic geogra-phy, 46(sup1):234–240, 1970.

[Wang et al., 2017] Pengfei Wang, Yanjie Fu, Guannan Li-u, Wenqing Hu, and Charu Aggarwal. Human mobilitysynchronization and trip purpose detection with mixtureof hawkes processes. In Proceedings of the 23rd ACMSIGKDD International Conference on Knowledge Discov-ery and Data Mining, pages 495–503. ACM, 2017.

[Xu et al., 2013] Chang Xu, Dacheng Tao, and Chao Xu.A survey on multi-view learning. arXiv preprint arX-iv:1304.5634, 2013.

[Xu et al., 2017] Hongteng Xu, Weichang Wu, Shamim Ne-mati, and Hongyuan Zha. Patient flow prediction viadiscriminative learning of mutually-correcting processes.IEEE transactions on Knowledge and Data Engineering,29(1):157–171, 2017.

[Yang and Zha, 2013] Shuang-Hong Yang and HongyuanZha. Mixture of mutually exciting processes for viral dif-fusion. In International Conference on Machine Learning,pages 1–9, 2013.

[Ye et al., 2011] Mao Ye, Peifeng Yin, Wang-Chien Lee, andDik-Lun Lee. Exploiting geographical influence for col-laborative point-of-interest recommendation. In Proceed-ings of the 34th international ACM SIGIR conference onResearch and development in Information Retrieval, pages325–334. ACM, 2011.

[Zhang et al., 2017] Junbo Zhang, Yu Zheng, and DekangQi. Deep spatio-temporal residual networks for citywidecrowd flows prediction. In AAAI, pages 1655–1661, 2017.

[Zhou et al., 2013] Ke Zhou, Hongyuan Zha, and Le Song.Learning social infectivity in sparse low-rank networks us-ing multi-dimensional hawkes processes. In Artificial In-telligence and Statistics, pages 641–649, 2013.


2482

exploiting graph regularized multi-dimensional hawkes ...3 graph regularized multi-dimensional...

Documents