ssc pres invited
DESCRIPTION
aTRANSCRIPT
-
On the Computational and Statistical Interface and Big Data
Michael I. Jordan University of California, Berkeley
May 26, 2014
With: Venkat Chandrasekaran, John Duchi, Martin Wainwright and Yuchen Zhang
-
What Is the Big Data Phenomenon?
Science in confirmatory mode (e.g., particle physics) Science in exploratory mode (e.g., astronomy, genomics) Measurement of human activity, particularly online
activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets
Sensor networks are becoming pervasive
-
What Are the Conceptual/Mathematical Issues?
The need to control statistical risk under constraints on algorithmic runtime how do risk and runtime trade off as a function of the amount of
data?
-
What Are the Conceptual/Mathematical Issues?
The need to control statistical risk under constraints on algorithmic runtime how do risk and runtime trade off as a function of the amount of
data? Statistical with distributed and streaming data
how is inferential quality impacted by communication constraints?
-
What Are the Conceptual/Mathematical Issues?
The need to control statistical risk under constraints on algorithmic runtime how do risk and runtime trade off as a function of the amount of
data? Statistical with distributed and streaming data
how is inferential quality impacted by communication constraints?
The tradeoff between statistical risk and privacy
-
What Are the Conceptual/Mathematical Issues?
The need to control statistical risk under constraints on algorithmic runtime how do risk and runtime trade off as a function of the amount of
data? Statistical with distributed and streaming data
how is inferential quality impacted by communication constraints?
The tradeoff between statistical risk and privacy Many other issues that require a blend of statistical
thinking (e.g., a focus on sampling, confidence intervals, evaluation, diagnostics, causal inference) and computational thinking (e.g., scalability, abstraction)
-
Data as a Resource
Computer science studies the management of resources, such as time and space and energy
-
Data as a Resource
Computer science studies the management of resources, such as time and space and energy
Data has not been viewed as a resource, but as a workload
-
Data as a Resource
Computer science studies the management of resources, such as time and space and energy
Data has not been viewed as a resource, but as a workload
The fundamental issue is that data now needs to be viewed as a resource the data resource combines with other resources to yield timely,
cost-effective, high-quality decisions and inferences
-
Data as a Resource
Computer science studies the management of resources, such as time and space and energy
Data has not been viewed as a resource, but as a workload
The fundamental issue is that data now needs to be viewed as a resource the data resource combines with other resources to yield timely,
cost-effective, high-quality decisions and inferences Just as with time or space, it should be the case (to first
order) that the more of the data resource the better
-
Data as a Resource
Computer science studies the management of resources, such as time and space and energy
Data has not been viewed as a resource, but as a workload
The fundamental issue is that data now needs to be viewed as a resource the data resource combines with other resources to yield timely,
cost-effective, high-quality decisions and inferences Just as with time or space, it should be the case (to first
order) that the more of the data resource the better not true in our current state of knowledge
-
Model complexity often grows much faster than number of data points
Big Data, Big Problems
-
Model complexity often grows much faster than number of data points
Statistical control is thus essential, but such control involves algorithms, and they may (and often do) scale poorly with the number of data points
Big Data, Big Problems
-
Model complexity often grows much faster than number of data points
Statistical control is thus essential, but such control involves algorithms, and they may (and often do) scale poorly with the number of data points
Even worse, the more data the less likely a sophisticated algorithm will run in an acceptable time frame and then we have to back off to cheaper algorithms that may be
more error-prone or we can subsample, but this requires knowing the statistical
value of each data point, which we generally dont know a priori
Big Data, Big Problems
-
Take (classical) statistical decision theory as a mathematical point of departure
Treat computation, communication, privacy, etc as constraints on statistical risk
This induces tradeoffs among these quantities and the number of data points
Our Approach
-
Take (classical) statistical decision theory as a mathematical point of departure
Treat computation, communication, privacy, etc as constraints on statistical risk
This induces tradeoffs among these quantities and the number of data points
Under the hood: geometry, information theory and optimization
Our Approach
-
Background on minimax decision theory Privacy constraints Communication constraints Computational constraints (via optimization)
Outline
-
In the 1930s, Wald laid the foundations of statistical decision theory
Given a family of probability distributions , a parameter for each , an estimator , and a loss , define the risk:
Minimax principle [Wald, 39, 43]: choose
estimator minimizing worst-case risk:
Background
supP2P
EPhl(, (P ))
i
P(P ) P 2 P
l(, (P ))
RP () := EPhl(, (P ))
i
-
Part I: Privacy and Minimax Risk
with John Duchi and Martin Wainwright University of California, Berkeley
-
Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur
We will quantify privacy loss via differential privacy
We then treat differential privacy as a constraint on inference via statistical decision theory
This yields (personal) tradeoffs between privacy loss and inferential gain
Privacy and Risk
-
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
-
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
b
-
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
bPrivate
-
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
bPrivate
Xi
Zi
bQ( | Xi)
-
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
bPrivate
Xi
Zi
bQ( | Xi)
Channel
-
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
bPrivate
Xi
Zi
bQ( | Xi)
Channel
Individuals with private data Estimator
Xiiid Pi 2 {1, . . . , n}
Zn1 7! b(Zn1 )
-
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
bPrivate
Xi
Zi
bQ( | Xi)
Channel
Individuals with private data Estimator
Xiiid Pi 2 {1, . . . , n}
Zn1 7! b(Zn1 )
-
Definitions of privacyDefinition: channel is -differentially private if
Q
[Dwork, McSherry, Nissim, Smith 06]
Xi
Zi
bQ( | Xi)
supS,x2X ,x02X
Q(Z 2 S | x)Q(Z 2 S | x0) exp()
-
Definitions of privacyDefinition: channel is -differentially private if
Q
[Dwork, McSherry, Nissim, Smith 06]
Xi
Zi
bQ( | Xi)
supS,x2X ,x02X
Q(Z 2 S | x)Q(Z 2 S | x0) exp()
logQ(z | x)logQ(z | x0)
-
Definitions of privacyDefinition: channel is -differentially private if
Q
[Dwork, McSherry, Nissim, Smith 06]
Xi
Zi
bQ( | Xi)
Testing interpretationvia Neyman Pearson[Wasserman & Zhou 10]
supS,x2X ,x02X
Q(Z 2 S | x)Q(Z 2 S | x0) exp()
logQ(z | x)logQ(z | x0)
Given Z, cannot recover x
-
Private Minimax RiskCentral object of study: minimax risk
Minimax risk
Parameter of distribution(P )Family of distributions PLoss measuring error`
Mn((P), `) := infb supP2PEPh`(b(Xn1 ), (P ))i
-
-private Minimax risk
Private Minimax RiskCentral object of study: minimax riskParameter of distribution(P )Family of distributions PLoss measuring error`
Mn((P), `,) := infQ2Q
infb supP2PEP,Qh`(b(Zn1 ), (P ))i
Family of private channelsQ
-
-private Minimax risk
Private Minimax RiskCentral object of study: minimax risk
Best -private channel
Parameter of distribution(P )Family of distributions PLoss measuring error`
Mn((P), `,) := infQ2Q
infb supP2PEP,Qh`(b(Zn1 ), (P ))i
Family of private channelsQ
-
-private Minimax risk
Private Minimax RiskCentral object of study: minimax risk
Best -private channelMinimax risk under privacy constraint
Parameter of distribution(P )Family of distributions PLoss measuring error`
Mn((P), `,) := infQ2Q
infb supP2PEP,Qh`(b(Zn1 ), (P ))i
Family of private channelsQ
-
Vignette: private mean (location) estimation
Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances
-
Proportions =
Vignette: private mean (location) estimation
Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances
1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines
123456
= .45= .32= .16= .20= .00= .02
-
Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for
(P ) := EP [X] 2 Rd`1 E[kb k1]
Pd :=distributions P supported on [1, 1]d
-
Proposition:
Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for
(P ) := EP [X] 2 Rd`1 E[kb k1]
Pd :=distributions P supported on [1, 1]d
Minimax rateMn(Pd, kk1) min
1,
plog dpn
(achieved by sample mean)
-
Proposition:Private minimax rate for = O(1)
Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for
(P ) := EP [X] 2 Rd`1 E[kb k1]
Pd :=distributions P supported on [1, 1]d
Mn(Pd, kk1 ,) min1,
pd log dpn2
-
Proposition:Private minimax rate for = O(1)
Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for
(P ) := EP [X] 2 Rd`1 E[kb k1]
Pd :=distributions P supported on [1, 1]d
Mn(Pd, kk1 ,) min1,
pd log dpn2
Effective sample size n 7! n2/dNote:
-
Optimal mechanism?
Non-privateobservation
X =
26666410100
377775
-
Optimal mechanism?
Non-privateobservation
X =
26666410100
377775Idea 1: add independent noise
(e.g. standard Laplacemechanism)
Z = X +W =
2666641 +W10 +W21 +W30 +W40 +W5
377775
-
Optimal mechanism?
Non-privateobservation
X =
26666410100
377775Idea 1: add independent noise
(e.g. standard Laplacemechanism)
Z = X +W =
2666641 +W10 +W21 +W30 +W40 +W5
377775
Problem: magnitude much too large(this is unavoidable: provably sub-optimal)
-
Optimal mechanism
Non-privateobservation
X =
26666410100
377775
-
Optimal mechanism
Non-privateobservation
X =
26666410100
377775 v =26666401100
377775 1 v =26666410011
377775View 1 View 2
Draw uniformly inv {0, 1}d
-
Optimal mechanism
Non-privateobservation
X =
26666410100
377775 v =26666401100
377775 1 v =26666410011
377775View 1 View 2
Draw uniformly inv {0, 1}d
With probability choose closer of and tov 1 v X
(closer : 3 overlap) (farther : 2 overlap)
otherwise, choose farther
e
1 + e
-
Optimal mechanism
Non-privateobservation
X =
26666410100
377775 v =26666401100
377775 1 v =26666410011
377775View 1 View 2
Draw uniformly inv {0, 1}d
With probability choose closer of and tov 1 v X
(closer : 3 overlap) (farther : 2 overlap)
otherwise, choose farther
e
1 + e
-
Optimal mechanism
Non-privateobservation
X =
26666410100
377775 v =26666401100
377775 1 v =26666410011
377775View 1 View 2
Draw uniformly inv {0, 1}d
With probability choose closer of and tov 1 v X
(closer : 3 overlap) (farther : 2 overlap)
At end:Compute sample
average andde-bias
otherwise, choose farther
e
1 + e
-
Empirical evidence
Estimate proportion of emergency room visits involving different substances
Data source: Drug Abuse Warning
NetworkSample size n
-
Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}
Mj(S) :=
ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)
Xi Zi1,2Pj Q
-
Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}
Mj(S) :=
ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)
Question: How much contraction does privacy induce?
Xi Zi1,2Pj Q
-
Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}
Mj(S) :=
ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)
Question: How much contraction does privacy induce?
Xi Zi1,2
Theorem (data processing): for any -private channel and i.i.d. sample of size
n
Dkl (M1||M2) +Dkl (M2||M1) 4n(e 1)2 kP1 P2k2TV
Pj Q
-
Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}
Mj(S) :=
ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)
Note: for n 7! n2 . 1
Question: How much contraction does privacy induce?
Xi Zi1,2
Theorem (data processing): for any -private channel and i.i.d. sample of size
n
Dkl (M1||M2) +Dkl (M2||M1) 4n(e 1)2 kP1 P2k2TV
Pj Q
-
Final remarks: privacy
Key: Allows identification of new optimal mechanisms
Rough technique: Reduction of estimation to testing,then apply information-theoretic testing lower bounds
[Le Cam, Hasminskii, Ibragimov, Assouad, Birge, Barron, Yu, ...]Additional examples
Fixed-design regression Convex risk minimization Multinomial estimation Nonparametric density estimation
n 7! n2Almost always: effective sample size reduction n 7! n
2
dIn d-dimensional problems:
-
Part II: Communication and Minimax Risk
with John Duchi, Martin Wainwright and Yuchen Zhang
University of California, Berkeley
-
Communication-constraints
[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
-
Communication-constraintsLarge data necessitates distributed storageIndependent data collection (hospitals)Privacy?
[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
-
Communication-constraints
X1 X2 Xm
Z1 Z2 Zm
b
Large data necessitates distributed storageIndependent data collection (hospitals)Privacy?
[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
-
Communication-constraints
X1 X2 Xm
Z1 Z2 Zm
bXi = (Xi1, X
i2, . . . , X
in)
Large data necessitates distributed storageIndependent data collection (hospitals)Privacy?
Setting: each of agents has sample of size
mn
Messages to fusion centerZi
[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
-
Communication-constraints
X1 X2 Xm
Z1 Z2 Zm
bXi = (Xi1, X
i2, . . . , X
in)
Large data necessitates distributed storageIndependent data collection (hospitals)Privacy?
Setting: each of agents has sample of size
mn
Messages to fusion centerZi
Question: tradeoffs between communication
and statistical utility?[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
-
Minimax risk with -bounded communicationB
Minimax communicationCentral object of study: Parameter of distribution(P ) Family of distributions P
Mn((P), B) := inf2B
infb supP2PEPhkb(Zm1 ) (P )k22i
X1 X2 Xm
Z1 Z2 Zm
b Loss kk22
-
Minimax risk with -bounded communicationB
Minimax communicationCentral object of study: Parameter of distribution(P ) Family of distributions P
Mn((P), B) := inf2B
infb supP2PEPhkb(Zm1 ) (P )k22i
Best protocol with smaller than bitsBZiZi = (Xi)
X1 X2 Xm
Z1 Z2 Zm
b Loss kk22
Constrained to be bits B
-
Vignette: mean estimationX1 X2 Xm
Z1 Z2 Zm
bConsider estimation in normallocation family,
Xijiid N(,2Idd)
2 [1, 1]d
-
Vignette: mean estimationX1 X2 Xm
Z1 Z2 Zm
bMinimax rateTheorem: when each agent has sample of sizen
E[kb(X1, . . . , Xm) k22] 2dnm
Consider estimation in normallocation family,
Xijiid N(,2Idd)
2 [1, 1]d
-
Minimax rate with B-bounded communication
Vignette: mean estimationX1 X2 Xm
Z1 Z2 Zm
bTheorem: when each agent has sample of sizen
bitsB
Consider estimation in normallocation family,
Xijiid N(,2Idd)
2 [1, 1]d
d
B ^ d1
logm
2d
nm.Mn(Nd, B) . d logm
B ^ d2d
nm
-
Minimax rate with B-bounded communication
Vignette: mean estimationX1 X2 Xm
Z1 Z2 Zm
bTheorem: when each agent has sample of sizen
bitsB
Consequence: each sends bits for optimal estimation d
Consider estimation in normallocation family,
Xijiid N(,2Idd)
2 [1, 1]d
d
B ^ d1
logm
2d
nm.Mn(Nd, B) . d logm
B ^ d2d
nm
-
Part III: Computation/Statistics Tradeoffs via Convex
Relaxation
with Venkat Chandrasekaran Caltech
-
Time-Data Tradeoffs Consider an inference problem with fixed risk Inference procedures viewed as points in plot
Runtime
Number of samples n
-
Time-Data Tradeoffs Consider an inference problem with fixed risk Vertical lines
Runtime
Number of samples n
Classical estimation theory well understood
-
Time-Data Tradeoffs Consider an inference problem with fixed risk Horizontal lines
Runtime
Number of samples n
Complexity theory lower bounds poorly understood
depends on computational model
-
Time-Data Tradeoffs Consider an inference problem with fixed risk
Runtime
Number of samples n
o Trade off upper bounds o More data means smaller
runtime upper bound o Need weaker
algorithms for larger datasets
-
An Estimation Problem
Signal from known (bounded) set Noise
Observation model
Observe n i.i.d. samples
-
Convex Programming Estimator
Sample mean is sufficient statistic
Natural estimator
Convex relaxation
C is a convex set such that
-
Statistical Performance of Estimator
Consider cone of feasible directions into C
-
Statistical Performance of Estimator
Theorem: The risk of the estimator is
Intuition: Only consider error in feasible cone
Can be refined for better bias-variance tradeoffs
-
Hierarchy of Convex Relaxations
Corr: To obtain risk of at most 1,
Key point:
If we have access to larger n, can use larger C
-
Hierarchy of Convex Relaxations
If we have access to larger n, can use larger C Obtain weaker estimation algorithm
-
Hierarchy of Convex Relaxations
If algebraic, then one can obtain family of outer convex approximations
polyhedral, semidefinite, hyperbolic relaxations (Sherali-Adams, Parrilo, Lasserre, Garding, Renegar)
Sets ordered by computational complexity Central role played by lift-and-project
-
Example 1
consists of cut matrices
E.g., collaborative filtering, clustering
-
Example 2
Signal set consists of all perfect matchings in complete graph
E.g., network inference
-
Example 3
consists of all adjacency matrices of graphs with only a clique on square-root many nodes
E.g., sparse PCA, gene expression patterns Kolar et al. (2010)
-
Example 4
Banding estimators for covariance matrices Bickel-Levina (2007), many others assume known variable ordering
Stylized problem: let M be known tridiagonal matrix
Signal set
-
Remarks
In several examples, not too many extra samples required for really simple algorithms
Approximation ratios vs Gaussian complexities approximation ratio might be bad, but doesnt matter as
much for statistical inference
Understand Gaussian complexities of LP/SDP hierarchies in contrast to theoretical CS
-
Conclusions
Many conceptual and mathematical challenges arise in taking seriously the problem of Big Data
-
Conclusions
Many conceptual and mathematical challenges arise in taking seriously the problem of Big Data
Facing these challenges will require a rapprochement between computer science and statistics, bringing them together at the level of their foundations
-
Conclusions
Many conceptual and mathematical challenges arise in taking seriously the problem of Big Data
Facing these challenges will require a rapprochement between computer science and statistics, bringing them together at the level of their foundations thus reshaping both disciplines
-
Conclusions
Many conceptual and mathematical challenges arise in taking seriously the problem of Big Data
Facing these challenges will require a rapprochement between computer science and statistics, bringing them together at the level of their foundations thus reshaping both disciplines
This wont happen overnight