multivariate sparse dynamic process modeling and...

un i v er s i ty of copenhagen

Faculty of Science

Multivariate sparse dynamic process modeling andinference

Niels Richard HansenDepartment of Mathematical Sciences

June 8, 2011

un i v er s i ty of copenhagen department of mathemat i ca l s c i ence s

Spike tracks from turtle motor neurons

Turtle motor neurons are used as models for neuronal activity.

The (dead) turtle is stimulated by scratching and recordings fromone or more electrodes register the spikes.

/29— Niels Richard Hansen — Multivariate sparse dynamic process modeling and inference — June 8, 2011

Spike tracks from turtle motor neurons

Turtle motor neurons are used as models for neuronal activity.

The (dead) turtle is stimulated by scratching and recordings fromone or more electrodes register the spikes.

time5 10 15 20 25 30 35

Data from stimulation period

time10 12 14 16 18

Point process modeling via intensities

We consider a filtered probability space (Ω,F ,Ft ,P) and aparametrized family (λt(θ))t≥0 of positive, predictable processesfor θ ∈ Θ.

The minus-log-likelihood is

lt(θ) =

0λs(θ)ds −

0log λs(θ)N(ds)

We will study penalized maximum-likelihood estimation of theparameter θ.

We consider a filtered probability space (Ω,F ,Ft ,P) and aparametrized family (λt(θ))t≥0 of positive, predictable processesfor θ ∈ Θ.The minus-log-likelihood is

lt(θ) =

0λs(θ)ds −

0log λs(θ)N(ds)

We consider a filtered probability space (Ω,F ,Ft ,P) and aparametrized family (λt(θ))t≥0 of positive, predictable processesfor θ ∈ Θ.The minus-log-likelihood is

lt(θ) =

0λs(θ)ds −

0log λs(θ)N(ds)

What are the goals?

λs(θ) =?

• Use ensembles of spike patterns for decoding of externalstimuli signals1

• Learn the connectivity graph for an ensemble of neurons fromdata2.

• Learn from data how signals propagate at the spike levelamong connected neurons (the functional forms).

• Learn if the network and / or the functional forms areadaptive to stimuli.

1Pillow et al. Spatio-temporal correlations and visual signalling in acomplete neuronal population. Nature, 454. 2008

2Masud et al. Statistical technique for analysing functional connectivity ofmultiple spike trains, J. Neuroscience Meth., 196, 2011.

What are the goals?

λs(θ) =?

What are the goals?

λs(θ) =?

What are the goals?

λs(θ) =?

What are the goals?

λs(θ) =?

A self-exciting model

With (Nt)t≥0 the counting process for the spike times andτ1, . . . τNt the jumps consider the model

λt(g) = φ

∑j :τj<t

g(t − τj)

(∫ t−

0g(t − s)N(ds)

)For a multivariate counting process, (N i

t)t≥0,i=1,...,K ,

λkt (g) = φ

K∑i=1

∑j :τ ij<t

g ik(t − τ ij )

which is the non-linear Hawkes process3. With φ(x) = x + d weget the linear Hawkes process.

3Bremaud, P. and Massoulie, L. Stability of nonlinear Hawkes processes.Ann. Probab. 24(3), 1996.

λt(g) = φ

∑j :τj<t

g(t − τj)

(∫ t−

0g(t − s)N(ds)

For a multivariate counting process, (N it)t≥0,i=1,...,K ,

λkt (g) = φ

K∑i=1

∑j :τ ij<t

g ik(t − τ ij )

which is the non-linear Hawkes process3.

With φ(x) = x + d weget the linear Hawkes process.

λt(g) = φ

∑j :τj<t

g(t − τj)

(∫ t−

0g(t − s)N(ds)

For a multivariate counting process, (N it)t≥0,i=1,...,K ,

λkt (g) = φ

K∑i=1

∑j :τ ij<t

g ik(t − τ ij )

λt(g) = φ

∑j :τj<t

g(t − τj)

(∫ t−

0g(t − s)N(ds)

)For a multivariate counting process, (N i

t)t≥0,i=1,...,K ,

λkt (g) = φ

K∑i=1

∑j :τ ij<t

g ik(t − τ ij )

Intensities

12.4 12.6 12.8 13.0 13.2

variable v5.1

log_intensity

Example of estimated linear filters and log intensity with φ = exp.

Local independence

We say that Nkt is locally independent of N i

t if the Ft-intensity isσ((N−is )s∈[0,t))-adapted.

We write i 6→ k and this defines theoriented local independence graph G = (1, . . . ,K,E ).

For the non-linear Hawkes process (i , k) ∈ E if and only if g ik 6= 0.

The concept was introduced by Tore Schweder4 for finite statespace Markov processes and extended to general point processesby Vanessa Didelez5.

The local independence graph has a causal connotation related toGranger Causality, and it finds applications in the literature oncausal inference.

4Composable Markov Processes, J. Appl. Prob. (1970), 7(2)5Graphical models for marked point processes based on local independence,

J. R. Statist. Soc. B (2008) 70(1)

Local independence

t if the Ft-intensity isσ((N−is )s∈[0,t))-adapted. We write i 6→ k and this defines theoriented local independence graph G = (1, . . . ,K,E ).

J. R. Statist. Soc. B (2008) 70(1)

Local independence

J. R. Statist. Soc. B (2008) 70(1)

Local independence

J. R. Statist. Soc. B (2008) 70(1)Slide 9/29— Niels Richard Hansen — Multivariate sparse dynamic process modeling and inference — June 8, 2011

Local independence

J. R. Statist. Soc. B (2008) 70(1)Slide 9/29— Niels Richard Hansen — Multivariate sparse dynamic process modeling and inference — June 8, 2011

Joint likelihood

If λi (θi ) is parametrized by θi ∈ Θi the joint minus-log-likelihood is

K∑i=1

0λis(θi )ds −

0log λis(θi )N i (ds)︸︷︷︸

l it (θi )

hence if we have variation independence of θ1, . . . , θK we need tominimize each l it (θi ) separately.

We can think of the i ’th term as a likelihood for a (conditional)model of i ’th counting process and we consider only such models.

Which class of models to consider?

Joint likelihood

K∑i=1

0λis(θi )ds −

l it (θi )

Joint likelihood

K∑i=1

0λis(θi )ds −

l it (θi )

Joint likelihood

K∑i=1

0λis(θi )ds −

l it (θi )

Generalized linear point process models

(Xt)t≥0 is a predictable, cadlag process process with values in V ∗

– the dual of the vector space V , and

Θ(D) = β ∈ V | Xs−β ∈ D for all s ∈ [0, t] P-a.s..

φ : D → [0,∞) and (Yt)t≥0 is a predictable, cadlag process withvalues in [0,∞).

Definition

A generalized linear point process model on [0, t] is the statisticalmodel with parameter space Θ(D) such that for β ∈ Θ(D) thepoint process on [0, t] has intensity

λs = Ysφ(Xs−β)

for s ∈ [0, t].

Definition

λs = Ysφ(Xs−β)

for s ∈ [0, t].

Definition

λs = Ysφ(Xs−β)

for s ∈ [0, t].

Stochastic integrals as linear functionalsIf g : [0,∞)→ R is a measurable, locally bounded function and(Zt)t≥0 a semi-martingale we can define the linear filter

0g(t − s)dZs

The functiong 7→ Xtg

is an ω-wise continues linear functional on the Sobolev spaceW m,2([0, t]) for m ≥ 1.

Proof by integration by parts∫ t

0h(s)dZs = h(t)Zt − h(0)Z0 −

0Zs−h′(s)ds

0g(t − s)dZs

0h(s)dZs = h(t)Zt − h(0)Z0 −

0Zs−h′(s)ds

0g(t − s)dZs

0h(s)dZs = h(t)Zt − h(0)Z0 −

0Zs−h′(s)ds

Penalized maximum likelihood estimation

As a function of g ∈W m,2([0, t]) the minus-log-likelihood functionreads

lt(g) =

(∫ s−

g(s − u)dZu

)ds−

log(Ysφ

(∫ s−

g(s − u)dZu

))N(ds)

We are optimizing the penalized minus-log-likelihood

lt(g) + λ

0Dmg(s)2ds

over W m,2([0, t]).

Penalized maximum likelihood estimation

As a function of g ∈W m,2([0, t]) the minus-log-likelihood functionreads

lt(g) =

(∫ s−

g(s − u)dZu

)ds−

log(Ysφ

(∫ s−

g(s − u)dZu

))N(ds)

We are optimizing the penalized minus-log-likelihood

lt(g) + λ

0Dmg(s)2ds

over W m,2([0, t]).

Estimates

Dimensions: 28 x 28Column

5 10 15 20 25

v4.2v11.2

v16.1v15.1

Estimates

v4.2v11.2

v16.1v15.1

v8.1 Dimensions: 28 x 28Column

5 10 15 20 25

v4.2v11.2

v16.1v15.1

Estimates

time lag (s)

0.0 0.1 0.2 0.3 0.4 0.5

time lag (s)

0.0 0.1 0.2 0.3 0.4 0.5

Dimensions: 28 x 28

Column

5 10 15 20 25

v2.1v14.2

v3.2v2.5

5 10 15 20 25

v4.2v11.2

v16.1v15.1

A theoremLet τ1, . . . τNt denote the jump times for N.

Theorem

If φ(x) = x + d with domain (−d ,∞) then a minimizer of thepenalized minus-log-likelihood function over Θ((−d ,∞)) belongsto the finite dimensional subspace of W m,2([0, t]) spanned by thefunctions φ1, . . . , φm, the functions

hi (r) =

∫ τi−

R1(τi − u, r)dZu

for i = 1, . . . ,Nt together with the function

f (r) =

R1(s − u, r) dZuds =

YsR1(s − u, r)ds dZu.

R1m(s, r) =

∫ s∧r

(s − u)m−1(r − u)m−1

((m − 1)!)2du, φk(t) = tk−1/(k − 1)!

Counting process integrals

If (Zs)0≤s≤t is a counting process with jumps σ1, . . . , σZt the hi

basis functions are order 2m splines with knots in

τi − σj | i = 1, . . . ,Nt , j : σj < τi.

Another theorem

Theorem

If φ is continuously differentiable,

ηi (r) =

∫ τi−

R(τi − u, r)dZu

fg (r) =

Ysφ′(∫ s−

g(s − u)dZu

)R1(s − u, r)dsdZu.

Then the gradient of lt at g ∈ Θ(D) is

∇lt(g) = fg −Nt∑i=1

φ′(∫ τi−

0g(τi − u)dZu

)φ(∫ τi−

0g(τi − u)dZu

) ηi .

The problem with explosionGiven a predictable (candidate) intensity process (λt)t≥0 does itdefine a point process?

Yes but the likelihood process

Lt = exp

(t −

0λsds +

0log λsN(ds)

)may not be a martingale.

EP(Lt) = 1 if and only if the intensity defines a point process thatdoes not explode in [0, t].

We only have a dominated statistical model if we restrict ourattention to combinations of φ and processes (Xt)t≥0 such thatthe likelihood process is a martingale.

For non-exploding data Lt(θ) is, however, always sensible forrelative comparisons of models.

The problem with explosionGiven a predictable (candidate) intensity process (λt)t≥0 does itdefine a point process? Yes but the likelihood process

Lt = exp

(t −

0λsds +

0log λsN(ds)

Lt = exp

(t −

0λsds +

0log λsN(ds)

Lt = exp

(t −

0λsds +

0log λsN(ds)

Lt = exp

(t −

0λsds +

0log λsN(ds)

Generalization of local independence

There are abstract generalizations of local independence tosemi-martingales, of particular interest are the solutions to the SDE

dXt = G (Xt)dt + DdBt

where (Bt)t≥0 is p-dimensional Brownian motion, D is diagonaland G : Rp → Rp.

We say that X kt is locally independent of X i

t if Gk does not dependupon the i ’th coordinate.

Generalization of local independence

There are abstract generalizations of local independence tosemi-martingales, of particular interest are the solutions to the SDE

dXt = G (Xt)dt + DdBt

where (Bt)t≥0 is p-dimensional Brownian motion, D is diagonaland G : Rp → Rp.

We say that X kt is locally independent of X i

t if Gk does not dependupon the i ’th coordinate.

A Gaussian processWith G (x) = B(x − A) the solution to the SDE

dXt = B(Xt − A)dt + DdBt

becomes a Gaussian process with normal discrete time transitions

Xt ∼ N (ξ(x0, t),Σ(t))

ξ(x , t) = A + etB(x − A) (1)

Σ(t) =

0esBD2esB

Tds (2)

Xt ∼ N (ξ(x0, t),Σ(t))

ξ(x , t) = A + etB(x − A) (1)

Σ(t) =

0esBD2esB

Tds (2)

The minus-log-likelihood for discrete observations

lx(A,B,D) =n∑

[(xti − ξ(xti−1 ,∆i ))TΣ(∆i )

−1(xti − ξ(xti−1 ,∆i ))

+ log det Σ(∆i )] .

Xt ∼ N (ξ(x0, t),Σ(t))

ξ(x , t) = A + etB(x − A) (1)

Σ(t) =

0esBD2esB

Tds (2)

The pseudo minus-log-likelihood for discrete observations

lx(A,B) =n∑

(xti − ξ(xti−1 ,∆i ))T (xti − ξ(xti−1 ,∆i ))

||xti − A− e∆iB(xti−1 − A)||2.

Xt ∼ N (ξ(x0, t),Σ(t))

ξ(x , t) = A + etB(x − A) (1)

Σ(t) =

0esBD2esB

Tds (2)

The `1-penalized pseudo minus-log-likelihood for discreteobservations

lx(A,B) =n∑

||xti − A− e∆iB(xti−1 − A)||2 + λ∑i ,j

|Bij |

The non-linear penalized least squares problem

Minimization of

lx(B) =n∑

||xti − e∆iBxti−1 ||2 + λ

∑i ,j

|Bij |

is a non-linear least squares problem.

A coordinate wise Gauss-Newton method iteratively optimize onecoordinate at a time for the linear least squares problem

lx(B0 + B) 'n∑

||ri (B0)− DBe∆iB0(B)xti−1 ||2 + λ

∑i ,j

|Bij |

ri (B) = xti − e∆iBxti−1 . Computations of e∆iB and DBe∆iB

dominates. Problem does not “decouple” into regression problemsfor each coordinate.

Minimization of

lx(B) =n∑

∑i ,j

|Bij |

lx(B0 + B) 'n∑

||ri (B0)− DBe∆iB0(B)xti−1 ||2 + λ

∑i ,j

|Bij |

ri (B) = xti − e∆iBxti−1 .

Computations of e∆iB and DBe∆iB

Minimization of

lx(B) =n∑

∑i ,j

|Bij |

lx(B0 + B) 'n∑

||ri (B0)− DBe∆iB0(B)xti−1 ||2 + λ

∑i ,j

|Bij |

dominates.

Problem does not “decouple” into regression problemsfor each coordinate.

Minimization of

lx(B) =n∑

∑i ,j

|Bij |

lx(B0 + B) 'n∑

||ri (B0)− DBe∆iB0(B)xti−1 ||2 + λ

∑i ,j

|Bij |

Toy example: p = 20, A = 0, D = I

5 10 15 20

−1.0

−0.5

5 10 15 20

Eigenvalues

Real partImaginary part

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

Eigenvalues

5 10 15 20

−1.0

−0.5

5 10 15 20

Eigenvalues

5 10 15 20

Eigenvalues

5 10 15 20

−1.0

−0.5

5 10 15 20

Eigenvalues

0 100 200 300 400

Parameters

Toy example, Lasso estimation with N = 100

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

5 10 15 20

−1.0

−0.5

0 100 200 300 400

Is the discrete time AR-process not enough?

For equidistant observations the process is a vector AR(1)-process(with correlated noise)

Xi − Xi−1 = (e∆B − I )Xi−1 + εi

Is sparseness of B not related to sparseness of

e∆B − I = ∆B + O(∆2).

For equidistant observations the process is a vector AR(1)-process(with correlated noise)

Xi − Xi−1 = (e∆B − I )Xi−1 + εi

Is sparseness of B not related to sparseness of

e∆B − I = ∆B + O(∆2).

0 100 200 300 400

What about the (asymptotic) variance?

When the spectrum of B has no positive real parts

Σ(t)→ Γ =

∫ ∞0

esBD2esBTds

for t →∞.

The variance Γ for the invariant distribution solves

(B ⊗ I + I ⊗ B)Γ = −D2.

If BBT = BTB and BD = DB then BΓ = ΓB and

Γ = −(B + BT )−1D2.

Σ(t)→ Γ =

∫ ∞0

esBD2esBTds

for t →∞.

(B ⊗ I + I ⊗ B)Γ = −D2.

Γ = −(B + BT )−1D2.

Σ(t)→ Γ =

∫ ∞0

esBD2esBTds

for t →∞.

(B ⊗ I + I ⊗ B)Γ = −D2.

Γ = −(B + BT )−1D2.

Toy example asymptotic variance

5 10 15 20

−0.03

−0.02

−0.01

Acknowledgements

• Rune Berg and Susanne Ditlevsen, University of Copenhagen,for collaboration on the neuron data analysis.

• Lisbeth Carstensen, Albin Sandelin and Ole Winther,Bioinformatics Centre, University of Copenhagen, forcollaboration on the use of Hawkes processes for modelingregulatory cites on genomes.

• Alexander Sokol, University of Copenhagen, who currentlywork with estimation of sparse SDE’s.

• Patricia Bouret, Nice Sophia-Antipolis University, and VicentRivoirard, Universite Paris Sud - Orsay, with whom I work onoracle inequalities for the multivariate Hawkes process.

multivariate sparse dynamic process modeling and...

Documents

multivariate time series analysis on the dynamic ......

currency hedging strategies using dynamic multivariate...

sparse and low-rank multivariate hawkes processes

dynamic covariance models for multivariate financial time

currency hedging strategies using dynamic multivariate ·...

sparse representation of multivariate extremes with...

dynamic sparse factor analysis -...

sparse representation of multivariate extremes with...

a bayesian multivariate functional dynamic linear model ·...

sparse multivariate factor regression - arxiv

dynamic conditional correlation garch: a multivariate time

dynamic sparse-matrix allocation on gpus · dynamic...

db+grid: a novel dynamic blocked grid for sparse high

sparse dynamic programming on dags with small...

dynamic estimation of latent opinion from sparse survey...

dynamic conditional correlation: a simple class of...

multivariate correlations: applying a dynamic constraint...

kidney exchange in dynamic sparse heterogenous pools - mit

recursive dynamic cs: recursive recovery of sparse signal...

a dynamic multiscale magnifying tool for exploring large...