answering neuroscience questions from connectomics data ... · pdf fileanswering neuroscience...

Answering Neuroscience Questions from Connectomics Data using Statistical Tools

Joshua T. VogelsteinDept of Statistical Science & Mathematics, Duke University

Institute for Data Intensive Engineering and Sciences, Johns Hopkins UniversityEndeavor Scientist Fellowship, Child Mind Institute

I’ve tried to avoid text being down here so everybody can see everything

Take Home Messages• Graphs are mathematical objects too!

• Standard (“Euclidean”) statistical tools are inappropriate

• Nonetheless, we can write down statistical distributions over graphs

• We can formally state many neurobiological questions via statistical graph theory (SGT)

• We can map graphs to Euclidean space, we want those mappings to have desired statistical properties such as consistency, robustness, etc.

• Sometimes STG may be useful

Outline• Motivation

• Some theory stuff

• (an application)

• Celebrations!

A Concrete Motivating Example

• We estimate graphs from two populations of brains (e.g., different psychiatric conditions, sex, personalities, etc.)

• We want to know: are the two populations different

• This is like a two-sample t-test for graph-valued observations

What I Do & Don’t Care About(for the purposes of this talk)

• Don’t: How to estimate graphs

• Don’t: Where the graphs came from, eg, MRI, EM, Calcium, Ephys, etc.

• Do: I assume somebody gave me graphs estimated from neural data, some how, using some experimental technique, with neurons and synapses wrong/missing, from some species, at some scale, and i don’t care how (for the purposes of this talk)

Formal Statement of Problem

• G1,...,Gn ~ F0, Gn+1,....Gn+m ~ F1

• H0: F0 = F1

• HA: F0 != F1

• NB: all graphs have the same vertex set (for here, for now)

Graph are Mathematical Objects Too

• G=(V,E)

• V is a set of vertices (nodes) (perhaps a vertex is a neuron)

• E is a set of edges (arcs/links) (perhaps an edge is a synapse)

• Graphs are simple meaning: edges are binary, undirected, no loops (for here, for now)

• I am not analyzing functions of graphs (eg, degree distribution) in this talk; that is an interesting and complementary topic

Why Not Just Use Lasso?

• A is an adjacency matrix, where A(u,v)=1 iff u~v

• Let A & A’ be adjacency matrices of two graphs

• We could vectorize and then use standard techniques, but we might lose some structure from the data

• For example, rows & cols of A correspond to the same vertex, if we vectorize, standard analysis techniques do not use that information

Recall: All of Statistics• The statistical properties of a hypothesis

test (e.g., its power) depends on a statistical model

• For example, a t-test is optimal under certain assumptions data

• But when data are corrupted, robust methods, such as the rank-sum test, have higher power

Conjecture: SGT might be useful to cast and address connectomics questions

Distributions over graphs

• G ~ P, P is some distribution over graphs

• P is discrete, so P(G) is the likelihood of graph G

• Two extremes examples: (i) ER(n,p), (ii) Categorical(theta)

• Number of possible graphs with n vertices?

(draw it; booyah Pillow!)

Latent Position Random Graphs• P[A(u,v)] = f(u,v) in (0,1)

• Posit the existence of a latent vector for each vertex

• The probability of a connection twix u & v is independent of everything conditioned on the two latent vectors

• Intuition from: (i) social network analysis, (ii) neuroscience

• We can also include observed attributes for each vector

Random Dot Product Graphs• Let Xu in R^d for each u

• f(u,v) = <Xu,Xv>

• X=(X1,...,Xn) can be estimated consistently up to a rotation via eig

• X can be estimated quickly via eig

• For sparse graphs, X can be estimated even with n=10^6 or more

• The stochastic block model is a special case

Parametric rainbow

Massively parametric = (practically) nonparametric"5

number of parameters

IndependentBernoulli

HistogramIsing

RBM 3rd orderMaxEnt

cascadedlogistic

Parametric Massively parametric

• Slide from Il Memming Park

• Model of spikes vs graphs

Our Generative Model• for each graph i, y_i ~ Bernoulli(p) # class

• for each graph i, for each vertex u

• X_uî | yî ~ Dirichlet(theta_y) # latent positions

• for each graph i, for each edge,

• A(u,v)î = <X_uî,X_vî> # edges

• (you can specify a prior on p and theta’s if you want)

A Simulated ExampleFx | 0

0.5 1

1

0.5

0

Fx | 1

dimension 1

dim

ensi

on 2

0.5 1

1

0.5

0

0 0.5 10

0.5

1Xi, i ! N 0

0 0.5 10

0.5

1Xj, j ! N 1

0 25 500

25

50Ai, i ! N 0

0 25 500

25

50

vertex #

verte

x #

Aj, j ! N 1

Schematic of Our Approach• Estimate the latent position matrix for each graph

• Compute all pairwise distances between those estimates

• Embed those distances into low-dimensional subspace (via MDS)

• Use standard statistical tests on the embedded graphs

• Gretton & others have developed elegant theory for this style approach

• The art is in choosing the kernel

Our Distance Between Graphs

• d(G,G’) = min_W || Xhat - W*Xhat’||

• Can be solved via SVD: efficient, scalable, exact, awesome

Estimating the Illustration

0 0.5 10

0.5

1True Latent Positions

latent dim. 1

late

nt d

im. 2

0 1 20

1

2Estimated Latent Positions

est. latent dim. 1

est.

late

nt d

im. 2

sample #

sam

ple

#

Distance Matrix

20 40 60 80 100

10080604020

0

−0.3 0 0.4−0.4

0

0.5Class 0 Density Estimate

coordinate 1

coor

dina

te 2

−0.3 0 0.4−0.4

0

0.5Class 1 Density Estimate

coordinate 1

coor

dina

te 2

−0.3 0 0.4−0.4

0

0.5

coordinate 1

coor

dina

te 2

Embedded Graphs

Power of Our Approach

0 50 100 150 200

0.5

1

pow

er

sample size0 25 50 75 1000

0.5

1po

wer

# vertices

NB: i am not claiming this is the best possible method,rather, i’m saying that we have a consistent statistical test

Application to Sex

0 20 40 600

20

40

60

Ai, i ! N 0

0 20 40 600

20

40

60

vertex #

verte

x #

Aj, j ! N 1

sample #

sam

ple

#

Distance Matrix

20 40

40

20

0−9 −2 6 13−8

−0

7

15Class 0 Density Estimate

coordinate 1

coor

dina

te 2

−9 −2 6 13−8

−0

7

15Class 1 Density Estimate

coordinate 1

coor

dina

te 2

−9 −2 6 13−8

−0

7

15

coordinate 1

coor

dina

te 2

Embedded Graphs

Acknowledgements• Carey Priebe

• R. Jacob Vogelstein

• Daniel Sussman

• Vince Lyzinski

• Youngser Park

• Minh Tang

• Yummy

• DARPA (XDATA)

• Child Mind Institute

• CRCNS

• You (please interrupt!)

Final Slide!• Graphs are awesome, and we can treat them as mathematical objects

and develop statistical tools specifically for graph valued data

• We’ve only just begun....we don’t yet have code to conduct most analyses that we want

• But we have obtained sufficient theory and emperical intuition to develop such tools as appropriate

• Call me: 443.858.9911, [email protected], http://jovo.me

• Questions?

mailto:[email protected]

mailto:[email protected]

http://jovo.me

http://jovo.me

answering neuroscience questions from connectomics data ... · pdf fileanswering neuroscience...

Documents