joint works with to memory-sample (cmu), ran raz avishay
TRANSCRIPT
Extractor-Based Approach to Memory-Sample Tradeoffs
Joint Works with Pravesh Kothari (CMU), Ran Raz (Princeton) and Avishay Tal (UC Berkeley)
Sumegha Garg
(Harvard)
โ Increasing scale of learning problems โ Many learning algorithms try to learn a
concept/hypothesis by modeling it as a neural network. The algorithm keeps in the memory some neural network and updates the weights when new samples arrive. Memory used is the size of the network (low if network size is small).
Learning under Memory Constraints?
Brief Description for the Learning ModelA learner tries to learn ๐: ๐ด โ {0,1}, ๐ โ ๐ป (hypothesis class)
from samples of the form (๐!, ๐(๐!)), ๐", ๐(๐") , โฆ
that arrive one by one
[Shamirโ14], [SVWโ15]Can one prove unconditional strong lower bounds on the number of samples needed for learning under memory constraints?
(when samples are viewed one by one)
Outline of the Talkโข The first example of such memory-sample tradeoff:
Parity learning [Razโ16]
โข Spark in the field: Subsequent results
โข Extractor-based approach to lower bounds [G-Raz-Talโ18]
โข Shallow dive into the proofs
โข Implications for refutation of CSPs [G-Kothari-Razโ20]
Example: Parity Learning
Parity Learning๐ โ# 0,1 $ is unknown
A learner tries to learn ๐ = (๐!, ๐", โฆ , ๐$) from
(๐!, ๐!), ๐", ๐" , โฆ , (๐%, ๐%), where โ ๐ก,
๐& โ# 0,1 $ and ๐& =< ๐&, ๐ > = โ' ๐&' โ ๐' ๐๐๐ 2 (inner product mod 2)
Parity LearningA learner tries to learn ๐ = (๐!, ๐", โฆ , ๐$) from
(๐!, ๐!), ๐", ๐" , โฆ , (๐%, ๐%), where โ ๐ก,
๐& โ# 0,1 $ and ๐& = โ' ๐&' โ ๐' ๐๐๐ 2
In other words, learner gets random binary linear equations in ๐!, ๐", . . , ๐$, one by one, and need to solve them
Letโs Try Learning (n=5)๐! + ๐( + ๐) = 1
Letโs Try Learning (n=5)๐" + ๐) + ๐* = 0
Letโs Try Learning (n=5)๐! + ๐" + ๐) = 1
Letโs Try Learning (n=5)๐" + ๐( = 0
Letโs Try Learning (n=5)๐! + ๐" + ๐) + ๐* = 0
Letโs Try Learning (n=5)๐! + ๐( = 1
Letโs Try Learning (n=5)๐" + ๐( + ๐) + ๐* = 1
Letโs Try Learning (n=5)๐! + ๐" + ๐* = 0
Letโs Try Learning (n=5)๐! + ๐" + ๐* = 0
๐" + ๐( + ๐) + ๐* = 1
๐! + ๐( = 1๐! + ๐" + ๐) + ๐* = 0 ๐" + ๐( = 0
๐! + ๐" + ๐) = 1๐" + ๐) + ๐* = 0
๐! + ๐( + ๐) = 1
๐ = (0,1,1,0,1)
Parity Learnersโข Solve independent linear equations (Gaussian
Elimination)
๐ถ(๐) samples and ๐ถ(๐๐) memory
โข Try all possibilities of ๐
๐ถ(๐) memory but exponential number of samples
Razโs Breakthrough โ16Any algorithm for parity learning of size ๐ requires either memory of ๐"/25 bits or exponential number of samples to learn
Memory-Sample Tradeoff
Motivations from Other Fields Bounded Storage Cryptography: [Razโ16, GZโ19, VVโ16, KRTโ16, TTโ18, GZโ19, JTโ19, DTZโ20, GZโ21] Keyโs length: ๐ Encryption/Decryption time: ๐
Unconditional security, if attackerโs memory size < ๐๐
๐๐
Complexity Theory: Time-Space Lower Bounds have been studied in many models [BJSโ98, Ajtโ99, BSSVโ00, Forโ97, FLvMVโ05, Wilโ06,โฆ]
Subsequent Results
Subsequent Generalizationsโข [KRTโ17]: Generalization to learning sparse paritiesโข [Razโ17, MMโ17, MTโ17, MMโ18, DKSโ19, GKLRโ21]:
Generalization to larger class of learning problems
[Garg-Raz-Talโ18]Extractor-based approach to prove memory-sample tradeoffs for a large class of learning problemsImplied all the previous results + new lower boundsClosely follows the proof technique of [Razโ17]
[Beame-Oveis-Gharan-Yangโ18] independently proved related lower bounds
Subsequent Generalizations[Sharan-Sidford-Valiantโ19]: Any algorithm performing linear regression over a stream of ๐-dimensional examples, uses a quadratic amount of memory or exhibits a slower rate of convergence than can be achieved without memory constraintFirst-order methods for learning may have provably slower rate of convergence
More Related Workโข [GRTโ19] Learning under multiple passes over the
streamโข [DSโ18, AMNโ18, DGKRโ19, GKRโ20] Distinguishing, testing
under memory or communication constraintsโข [MTโ19, GLMโ20] Towards characterization of memory-
bounded learningโข [GRZโ20] Comments on quantum memory-bounded
learning
Extractor-Based Approach to Lower Bounds [G-Raz-Talโ18]
Learning Problem as a Matrix๐ด, ๐น : finite sets, ๐:๐ดร๐น โ {โ1,1} : a matrix
๐น : concept class = 0,1 $
๐ด : possible samples = 0,1 $.
Learning Problem as a Matrix๐:๐ดร๐น โ {โ1,1} : a matrix
๐ โ# ๐น is unknown. A learner tries to learn ๐ from a stream(๐!, ๐!), ๐", ๐" , โฆ , ๐%, ๐% , where โ๐ก : ๐& โ# ๐ด and ๐& = ๐(๐&, ๐)
๐
๐(๐!, ๐)๐!
๐(๐", ๐)๐"
Main TheoremAssume that any submatrix of ๐ of at least 2/0|๐ด| rows and at least 2/โ|๐น| columns, has a bias of at most 2/2. Then,Any learning algorithm requires either ๐ด(๐๐)memory bits or ๐๐ด(๐)samples
2#$|๐ด|
2#%|๐น|
ApplicationsLow-degree polynomials: A learner tries to learn an ๐-variate multilinear polynomial ๐ of degree at most ๐ over F", from random evaluations of ๐ over F"$
๐(๐๐ 8๐) memory or ๐๐(๐) samples
ApplicationsLearning from sparse equations: A learner tries to learn ๐= ๐!, โฆ , ๐$ โ {0,1}$ , from random sparse linear equations, of sparsity ๐, over F" (each equation depends on ๐ variables)
๐(๐ โ ๐) memory or ๐๐(๐) samplesWe will use this result for
proving sample lower bounds
for refuting CSPs under
memory constraints
Proof Overview
Branching Program (length ๐, width ๐)
Each layer represents a time step. Each vertex represents a memory state of the learner (๐ = 2!"!#$%). Each non-leaf vertex has 2&'() outgoing edges, one for each ๐, ๐ โ 0,1 &!ร โ1,1
(๐!,๐!)
(๐, ๐)
๐
๐(๐% , ๐
% )(๐", ๐")
Branching Program (length ๐, width ๐)
The samples ๐), ๐) , . . , ๐!, ๐! define a computation-path. Each vertex ๐ฃ in the last layer is labeled by I๐* โ 0,1 &. The output is the label I๐* of the vertex reached by the path
(๐!,๐!)
(๐, ๐)
๐
๐(๐% , ๐
% )(๐", ๐")
Proof OverviewPK|M = distribution of ๐ conditioned on the event that the computation-path reaches ๐ฃ
Significant vertices: ๐ฃ s.t. ||PK|M||" โฅ 2N โ 2/$
๐๐ ๐ฃ = probability that the path reaches ๐ฃ
We prove that if ๐ฃ is significant, ๐๐ ๐ฃ โค 2/O(0โ N)
Hence, if there are less than 2Q(0โ N) vertices, with high probability, we reach a non-significant vertex
Proof OverviewProgress Function [Razโ17]: For layer ๐ฟ',
๐' = XMโR"
๐๐ ๐ฃ โ PK|M , PK|S 0
โ ๐! = 2"#$โ &
โ ๐' is very slowly growing: ๐! โ ๐(โ If ๐ โ ๐ฟ(, then ๐( โฅ ๐๐ ๐ โ 2#)โ & โ 2"#$โ &
Hence: If ๐ is significant, ๐๐ ๐ โค 2/O(0โ N)
Hardness of Refuting CSPs [G-Kothari-Razโ20]
Refuting CSPs under Streaming Modelโข Fix a predicate ๐: 0,1 0 โ {0,1}
โข With bounded memory, distinguish between CSPs (with predicate ๐) where the constraints are randomly generated vs ones where the constraints are generated using a satisfying assignment
โข Constraints arrive one by one in a stream
Refuting CSPs under Streaming Modelโข Fix a predicate ๐: 0,1 0 โ {0,1}
โข With bounded memory, distinguish between CSPs (with predicate ๐) where the constraints are randomly generated vs ones where the constraints are generated using a satisfying assignment
โข Constraints arrive one by one in a stream Refutation is at least as hard as distinguishing between Random CSPs and satisfiable ones
Refuting CSPs under Streaming Modelโข ๐ฅ is chosen uniformly from 0,1 $
โข At each step ๐, ๐' โ [๐] is a randomly chosen (ordered) subset of size ๐.
โข Distinguisher sees a stream of either ๐', ๐ ๐ฅT" or ๐', ๐' where ๐' is chosen uniformly from 0,1
๐ฅ&!: projection of ๐ฅ to coordinates of ๐'
Reduction from Learning from Sparse EquationsAny algorithm that distinguishes random ๐ โXOR (random CSPs with XOR predicate) on ๐ variables from
๐ โXOR CSPs with a satisfying assignment (under the streaming model),
requires either ๐(๐ โ ๐) memory or ๐๐(๐) constraints
Reduction from Learning from Sparse EquationsAny algorithm that distinguishes random ๐ โXOR (random CSPs with XOR predicate) on ๐ variables from
๐ โXOR CSPs with a satisfying assignment (under the streaming model),
requires either ๐(๐ โ ๐) memory or ๐๐(๐) constraints
For ๐ โซ ๐ฅ๐จ๐ ๐, refutation is hard for algorithms using even super-linear memory, when the number of constraints is < ๐๐(๐)
Better Sample Bounds for Sub-Linear Memoryโข Stronger lower bounds on the number of constraints
needed to refute
โข Even for CSPs with ๐ก-resilient predicates ๐ [OWโ14, ALโ16, KMOWโ17]
โข ๐: 0,1 0 โ {0,1} is ๐ก-resilient if ๐ has zero correlation with every parity function on at most ๐ก โ 1 bits
Better Sample Bounds for Sub-Linear MemoryFor every 0 < ๐ < 1, any algorithm that distinguishes
random CSPs (with a ๐ก-resilient predicate ๐) on ๐ variables
from satisfiable ones (when the constraints arrive in a streaming fashion),
requires either ๐๐ memory or ๐๐
๐(๐(๐/๐))constraints
Thank You!