answering neuroscience questions from connectomics data ... · pdf fileanswering neuroscience...
TRANSCRIPT
Answering Neuroscience Questions from Connectomics Data using Statistical Tools
Joshua T. VogelsteinDept of Statistical Science & Mathematics, Duke University
Institute for Data Intensive Engineering and Sciences, Johns Hopkins UniversityEndeavor Scientist Fellowship, Child Mind Institute
I’ve tried to avoid text being down here so everybody can see everything
Take Home Messages• Graphs are mathematical objects too!
• Standard (“Euclidean”) statistical tools are inappropriate
• Nonetheless, we can write down statistical distributions over graphs
• We can formally state many neurobiological questions via statistical graph theory (SGT)
• We can map graphs to Euclidean space, we want those mappings to have desired statistical properties such as consistency, robustness, etc.
• Sometimes STG may be useful
Outline• Motivation
• Some theory stuff
• (an application)
• Celebrations!
A Concrete Motivating Example
• We estimate graphs from two populations of brains (e.g., different psychiatric conditions, sex, personalities, etc.)
• We want to know: are the two populations different
• This is like a two-sample t-test for graph-valued observations
What I Do & Don’t Care About(for the purposes of this talk)
• Don’t: How to estimate graphs
• Don’t: Where the graphs came from, eg, MRI, EM, Calcium, Ephys, etc.
• Do: I assume somebody gave me graphs estimated from neural data, some how, using some experimental technique, with neurons and synapses wrong/missing, from some species, at some scale, and i don’t care how (for the purposes of this talk)
Formal Statement of Problem
• G1,...,Gn ~ F0, Gn+1,....Gn+m ~ F1
• H0: F0 = F1
• HA: F0 != F1
• NB: all graphs have the same vertex set (for here, for now)
Graph are Mathematical Objects Too
• G=(V,E)
• V is a set of vertices (nodes) (perhaps a vertex is a neuron)
• E is a set of edges (arcs/links) (perhaps an edge is a synapse)
• Graphs are simple meaning: edges are binary, undirected, no loops (for here, for now)
• I am not analyzing functions of graphs (eg, degree distribution) in this talk; that is an interesting and complementary topic
Why Not Just Use Lasso?
• A is an adjacency matrix, where A(u,v)=1 iff u~v
• Let A & A’ be adjacency matrices of two graphs
• We could vectorize and then use standard techniques, but we might lose some structure from the data
• For example, rows & cols of A correspond to the same vertex, if we vectorize, standard analysis techniques do not use that information
Recall: All of Statistics• The statistical properties of a hypothesis
test (e.g., its power) depends on a statistical model
• For example, a t-test is optimal under certain assumptions data
• But when data are corrupted, robust methods, such as the rank-sum test, have higher power
Conjecture: SGT might be useful to cast and address connectomics questions
Distributions over graphs
• G ~ P, P is some distribution over graphs
• P is discrete, so P(G) is the likelihood of graph G
• Two extremes examples: (i) ER(n,p), (ii) Categorical(theta)
• Number of possible graphs with n vertices?
(draw it; booyah Pillow!)
Latent Position Random Graphs• P[A(u,v)] = f(u,v) in (0,1)
• Posit the existence of a latent vector for each vertex
• The probability of a connection twix u & v is independent of everything conditioned on the two latent vectors
• Intuition from: (i) social network analysis, (ii) neuroscience
• We can also include observed attributes for each vector
Random Dot Product Graphs• Let Xu in R^d for each u
• f(u,v) = <Xu,Xv>
• X=(X1,...,Xn) can be estimated consistently up to a rotation via eig
• X can be estimated quickly via eig
• For sparse graphs, X can be estimated even with n=10^6 or more
• The stochastic block model is a special case
Parametric rainbow
Massively parametric = (practically) nonparametric"5
number of parameters
IndependentBernoulli
HistogramIsing
RBM 3rd orderMaxEnt
cascadedlogistic
Parametric Massively parametric
• Slide from Il Memming Park
• Model of spikes vs graphs
Our Generative Model• for each graph i, y_i ~ Bernoulli(p) # class
• for each graph i, for each vertex u
• X_u^i | y^i ~ Dirichlet(theta_y) # latent positions
• for each graph i, for each edge,
• A(u,v)^i = <X_u^i,X_v^i> # edges
• (you can specify a prior on p and theta’s if you want)
A Simulated ExampleFx | 0
0.5 1
1
0.5
0
Fx | 1
dimension 1
dim
ensi
on 2
0.5 1
1
0.5
0
0 0.5 10
0.5
1Xi, i ! N 0
0 0.5 10
0.5
1Xj, j ! N 1
0 25 500
25
50Ai, i ! N 0
0 25 500
25
50
vertex #
verte
x #
Aj, j ! N 1
Schematic of Our Approach• Estimate the latent position matrix for each graph
• Compute all pairwise distances between those estimates
• Embed those distances into low-dimensional subspace (via MDS)
• Use standard statistical tests on the embedded graphs
• Gretton & others have developed elegant theory for this style approach
• The art is in choosing the kernel
Our Distance Between Graphs
• d(G,G’) = min_W || Xhat - W*Xhat’||
• Can be solved via SVD: efficient, scalable, exact, awesome
Estimating the Illustration
0 0.5 10
0.5
1True Latent Positions
latent dim. 1
late
nt d
im. 2
0 1 20
1
2Estimated Latent Positions
est. latent dim. 1
est.
late
nt d
im. 2
sample #
sam
ple
#
Distance Matrix
20 40 60 80 100
10080604020
0
−0.3 0 0.4−0.4
0
0.5Class 0 Density Estimate
coordinate 1
coor
dina
te 2
−0.3 0 0.4−0.4
0
0.5Class 1 Density Estimate
coordinate 1
coor
dina
te 2
−0.3 0 0.4−0.4
0
0.5
coordinate 1
coor
dina
te 2
Embedded Graphs
Power of Our Approach
0 50 100 150 200
0.5
1
pow
er
sample size0 25 50 75 1000
0.5
1po
wer
# vertices
NB: i am not claiming this is the best possible method,rather, i’m saying that we have a consistent statistical test
Application to Sex
0 20 40 600
20
40
60
Ai, i ! N 0
0 20 40 600
20
40
60
vertex #
verte
x #
Aj, j ! N 1
sample #
sam
ple
#
Distance Matrix
20 40
40
20
0−9 −2 6 13−8
−0
7
15Class 0 Density Estimate
coordinate 1
coor
dina
te 2
−9 −2 6 13−8
−0
7
15Class 1 Density Estimate
coordinate 1
coor
dina
te 2
−9 −2 6 13−8
−0
7
15
coordinate 1
coor
dina
te 2
Embedded Graphs
Acknowledgements• Carey Priebe
• R. Jacob Vogelstein
• Daniel Sussman
• Vince Lyzinski
• Youngser Park
• Minh Tang
• Yummy
• DARPA (XDATA)
• Child Mind Institute
• CRCNS
• You (please interrupt!)
Final Slide!• Graphs are awesome, and we can treat them as mathematical objects
and develop statistical tools specifically for graph valued data
• We’ve only just begun....we don’t yet have code to conduct most analyses that we want
• But we have obtained sufficient theory and emperical intuition to develop such tools as appropriate
• Call me: 443.858.9911, [email protected], http://jovo.me
• Questions?