implicationsofspace-boundedcomputation...implicationsofspace-boundedcomputation sumeghagarg...

Implications of Space-Bounded Computation

Sumegha Garg

a dissertationpresented to the facultyof princeton university

in candidacy for the degreeof Doctor of Philosophy

recommended for acceptanceby the Department ofComputer Science

Adviser: Mark Braverman

Setember 2020

Abstract

The field of computational complexity theory studies the inherent difficulties of performing

certain computational tasks with limited resources. While characterizing the minimum time re-

quired for a computational task has receivedmore attention, understanding thememory require-

ments is as fundamental and fascinating. The focus of this thesis is understanding the limits of

space-bounded computation and implications for various algorithmic problems.

1. Implications of bounded space for learning algorithms: [SVW16] and [Sha14] started the

study of online learning under memory constraints. In a breakthrough result, [Raz16]

showed that learning an unknown n-bit vector from random linear equations (in F2) re-

quires either Ω(n2) space (memory) or 2Ω(n) samples. Work in this thesis extends these

memory-sample tradeoffs to a larger class of learning problems through an extractor-based

approach, to when the learner is allowed a second pass over the stream of samples and, to

proving security of Goldreich’s local pseudorandom generator against memory-bounded

adversaries in the streaming model.

2. Implications of bounded space for randomized algorithms: It is largely unknown if random-

ization gives space-bounded computation any advantage over deterministic computation.

The current best hope of the community is to derandomize randomized log-space com-

putation with one-sided error, that is, proveRL = L. A work presented in this thesis, in

a step towards answering this question, improved upon the state-of-the-art constructions

of hitting sets, which are tools for derandomization.

3. Implications of bounded space for mirror games: In this thesis, we show the impossibility

of winning the following streaming game under memory constraints. Alice and Bob take

turns saying numbers belonging to the set {1, 2, ..., 2N}. A player loses if they repeat a

iii

number that has already been said. Bob, who goes second, has a memory-less strategy to

avoid losing. Alice, however, needs at least Ω(N)memory to not lose.

iv

Contents

Abstract iii

0 Introduction 1

0.1 Space-Bounded Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . 2

0.2 Space-Bounded Randomized Algorithms . . . . . . . . . . . . . . . . . . . . 6

0.3 Space Requirements of Mirror Game . . . . . . . . . . . . . . . . . . . . . . 10

0.4 Organisation of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1 Preliminaries 12

1.1 Space-Bounded Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2 Read-Once Branching Programs . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Space-Bounded Learning Algorithms 18

2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Overview of the Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.5 Generalization to Non-Product Distributions . . . . . . . . . . . . . . . . . 63

3 Two-Pass Space-Bounded Learning Algorithms 83

v

3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.2 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.3 Overview of the Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.4 Proof of Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.5 Probability of Stopping at Significant Vertices . . . . . . . . . . . . . . . . . . 123

4 Security of Goldreich’s PRG against Space-Bounded Adversaries 139

4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

4.2 Overview of the Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

4.3 Time-Space Tradeoff through Reduction to Learning . . . . . . . . . . . . . . 149

4.4 Sample-Memory Tradeoffs for Resilient Local PRGs . . . . . . . . . . . . . . 157

5 Pseudorandom Pseudo-Distributions for ROBPs 172

5.1 Proof Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

5.3 Pseudorandom Pseudo-Distributions andMain Result . . . . . . . . . . . . . 200

5.4 Matrix Bundle Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

5.5 Multiplication Rules for Matrix Bundle Sequences . . . . . . . . . . . . . . . 207

5.6 LeveledMatrix Representations . . . . . . . . . . . . . . . . . . . . . . . . . 226

5.7 The FamilyF(A,B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

5.8 TheMultiplication Rule for LeveledMatrix Representations . . . . . . . . . . 246

6 Space Complexity of theMirror Game 260

6.1 TheMirror Game andMirror Strategies . . . . . . . . . . . . . . . . . . . . 260

6.2 Eventown and Oddtown . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

6.3 Randomized Strategies for Alice . . . . . . . . . . . . . . . . . . . . . . . . . 264

6.4 Deterministic Strategies of Alice Require Linear Space . . . . . . . . . . . . . 265

6.5 The (a, b)-Mirror Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

6.6 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

vi

7 Conclusion 270

Appendix A Appendix for Chapter 3 272

References 289

vii

Acknowledgments

The journey towards this thesis would not have been as fun and educational without the sup-

port of and interactions with a large number of people.

Firstly, I would like to thank the three professors, with whom I have interacted the most dur-

ing my PhD life. 1. I have been very fortunate to be advised by Mark Braverman. I am grateful

for his invaluable and to-the-point advice throughout my PhD, and his extraordinary insights

in the projects we tackled together. He never shies away from going into the technical nuances

of a problem and I wish to fully incorporate this skill in my research. 2. I feel very fortunate to

have worked with Ran Raz during my PhD. I am honored to have witnessed and learnt from his

focused and detailed approach to problem solving. I am grateful for his advice on academia over

the years. 3. I have been very fortunate to work and interact with Omer Reingold. Amongmany

other things, I admire his approach of applying ideas fromplayback theatre to research, for exam-

ple, the ‘yes and’ approach to collaboration. I am grateful to him for giving me the opportunity

to explore the field of algorithmic fairness and for generously introducing me to its community.

Next, I would like to thank my committee members – Mark Braverman, Gillat Kol, Ran Raz,

Matt Weinberg andMark Zhandry for taking out time and energy for my defense procedure.

I am grateful to have had the opportunity to collaborate with many passionate and com-

passionate academics – Mark Braverman, Gil Cohen, Michael Kim, Pravesh K. Kothari, Ran

Raz, Omer Reingold, Guy Rothblum, Jon Schneider, Ariel Schvartzman, Avishay Tal, David

viii

Woodruff, Gal Yona, Henry Yuen, Wei Zhan and Mark Zhandry. It has been a great pleasure

learning and working with them, and this thesis would not have been possible without the joint

collaborations. I am grateful for the interactions and discussions at the monthly meetings of

RISE (Research Inclusion Social Event). With members of RISE, I felt comfortable voicing my

concerns during the different stages ofmy PhD life. I am also grateful to have been part of Prince-

tonWSTEMLeadership Council and I admire the various efforts by its members to foster inclu-

sion at the university. Next, I would like to thank all the Princeton University staff, especially

Nicki Mahler and Mitra Kelly, for their prompt and generous support in various administrative

tasks over the years.

I am also thankful to the Siebel Foundation andMicrosoft Research for their scholarship and

dissertation grant respectively. My researchhas been fundedbyMark’s grants fromSimons Foun-

dation, Packard Foundation and NSF grants CCF-1525342 and CCF-1933331.

The time spent at thePrincetonCSdepartmentover the yearswouldnothavebeen as enjoyable

and stimulating without my office-mates and the friends made along the way. I would like to

thank all my office-mates and friends at the department for the numerous discussions and advice

– Matheus, Meryem, Vikram, Uma, Hedyeh, Niran, Cyril, Ariel, Qipeng, Fermi, Karan, Divya,

Nikunj, Mikhail, Raghuvansh, Rafael, Jon, Jieming, Gopi, Ko, Yatin, Naman, among others.

I am really grateful to all my friends for makingmy stay at Princeton somemorable andmean-

ingful – Akash, AkshayK, AkshayY, Arjun, Diksha, Divya, Niran, Nivedita, Pranav, Sanjay,

Sravya, Visu, Vivek, among others. There’s a famous quote by an unknown person – “there

are friends, there is family, and then there are friends that become family”. Next, I would like to

give special thanks to the friends that becamemy family over the years. “A friend knows the song

inmy heart and sings it tomewhenmymemory fails” byDonnaRoberts. I would use this quote

to describeNiran, with whom I have shared all aspects ofmy life at Princeton, and I thank her for

literally putting upwithmy badmemory. “Tis the privilege of friendship to talk nonsense, and to

have her nonsense respected” by Charles Lamb. This one is for Divya and Sravya, for the numer-

ous conversations over the years, some heartwarming, many aimless, all enjoyable. I would like

ix

to use the following quote, “it is not so much our friends’ help that helps us, as the confidence of

their help” by Epicurus, for Akash, AkshayK andArjun. I am grateful for their constant support

and patience during the lows of my Princeton life. The following quote, “there is nothing better

than a friend, unless it is a friend with chocolate” by Linda Grayson, is apt for AkshayY, and I

thank him for the many delicious eggless baked goods over the years. “True friendship comes

when the silence between two people is comfortable” by David Tyson, reminds me of Sanjay,

with whom my friendship has been filled with homely silences since the start. I appreciate his

up for anything attitude especially when concerned with Bollywoodmovies and travel. The next

quote, “the most beautiful discovery true friends make is that they can grow separately without

growing apart” by Elisabeth Foley, remindsme ofmy friends fromundergrad -Himanshu,Malti,

Misha and Surbhi. I thank them for taking me out of the Princeton bubble from time to time.

Most importantly, I would like to thank my parents - Devinder and Urmila, and my guru, for

all their love and support. I am what I am because of them. I cannot thank my brother (Ankit)

enough for his mentorship over the years. He made my life easier by doing the things I would

want to do but 4 years in advance. It is to my family and my guru that I dedicate this thesis.

I have grown both personally and professionally during the last 5 years. I thank everyone who

interacted with me even if for an hour and shared with me their knowledge and kindness.

x

The most difficult thing is the decision to

act. The rest is merely tenacity.

Amelia Earhart

0Introduction

The field of complexity theory studies the inherent limits of a computational model for perform-

ing certain computational tasks. Most natural is to study restricted computational models that

are short of a computational resource, for example, time or memory. While characterizing the

minimum time required for a computational task has received more attention, understanding

the memory requirements is as fundamental and fascinating, owing to the surprising and non-

intuitive results that the field has seen. For example,NL = coNL, that is, for log-space bounded

computations, and for the task of identifying whether an input x has certain property C, if it is

easy to certify that x has the property (x ∈ C), then it is also easy to certify that x ∈ C [Imm88].

Whereas, for time-bounded computations, it is widely believed thatNP = coNP, that is, a time-

1

efficient certification for x ∈ C doesn’t imply a time-efficient certification for x ∈ C. The theme

of this thesis is understanding the limits of space-bounded computational models and the impli-

cations for algorithmic problems. In particular, the focus is on implications of bounded space

for learning algorithms, randomized algorithms and streaming games.

0.1 Space-Bounded Learning Algorithms

Machine learning has been a growing field since the 1960s with a wide variety of applications, for

example, in computer vision, speech recognition and recommendation systems. Learning theory

studies the feasibility of learning using a number of training examples under bounded time, such

that the learned algorithms generalize to new examples. With the increasing scale of problems,

it has become both practically and philosophically important to study the feasibility of learning

under bounded space.

Moreover, many a classification problem these days is learned using a neural network, where

essentially, only the neural network is stored in the memory and the weights of the network are

updatedwithnew samples (in small batches). Thememoryused is (essentially) the size of the neu-

ral network. A natural question to ask is whether a hypothesis class can be efficiently learned, by

such lowmemory algorithms. [Sha14, SVW16] started the study of feasibility of online learning

undermemory constraints. In a breakthrough result, [Raz16] showed that learning an unknown

n-bit vector fromrandom linear equations inF2 requires 2Ω(n) equations (training samples)when

the learning algorithm has at most n225 memory. In contrast, there exists an efficient learning algo-

rithm that uses O(n2)memory, and O(n) equations. Thus, such a result can be interpreted as a

trade-off between the time and space (or memory and samples) required to learn.

Time-space tradeoffs have been extensively studied in complexity theory, such as in [Ajt99,

BSSV03], that construct boolean functions f : {0, 1}n → {0, 1}, such that any algo-

rithm requires either n1−ε memory (for small constant ε > 0) or super-linear time to com-

pute f, whereas f is efficiently computable without the memory constraints. Another line of

work, [For00, FLVMV05, Wil06], shows similar tradeoffs for problems in NP (problems that

2

are efficiently computable in nondeterministic time). Both these lines of works obtain less than

quadratic lower bounds on the timeneeded for computation, undermemory constraints. In con-

trast, the field of online learning under memory constraints, has seen exponential lower bounds

on the time needed for learning – because, for computing a function, one assumes unrestricted

access to the input, whereas for online learning, once the learner sees a sample, she cannot access

the sample again unless it is stored in the memory.

Starting with [Sha14, SVW16, Raz16], there has been a lot of recent progress [VV16, KRT17,

Raz17, MM17, MT17, MM18, BGY18, GRT18, DS18, SSV19, GRT19, DGKR19, GKR20]

in proving sample lower bounds for memory-bounded learning algorithms. In follow up

works, memory-sample lower bounds have been generalized to many learning problems such as

sparse parities [KRT17], linear-size DNF formulas [KRT17], low degree polynomials [BGY18,

GRT18], learning from sparse linear equations and low-degree polynomial equations [GRT18],

real-valued learning (linear regression) [SSV19], learning when the learner is allowed a second

pass over the samples [GRT19], distribution testing [DGKR19] and more general distinguish-

ing problems [GKR20].

Contribution 1 The first main contribution of this thesis is an extractor-based approach to

proving memory-sample trade-offs for a large class of learning problems. This is joint work with

Ran Raz and Avishay Tal [GRT18].

As in [Raz17], we represent a learning problem by a matrix. Let X, A be two finite sets of

size larger than 1(where X represents the concept-class (or hypothesis class) that we are trying to

learn and A represents the set of possible samples). LetM : A × X → {−1, 1} be a matrix. M

corresponds to the following learning problem: an unknown element x ∈ X is chosen uniformly

at random. A learner tries to learn x from a stream of samples, (a1, b1), (a2, b2) . . ., where for

every i, ai ∈ A is chosen uniformly at random and bi = M(ai, x).

Assume thatk, ℓ, r are such that any submatrix ofMof at least 2−k·|A| rows and at least 2−ℓ·|X|

columns, has a bias of at most 2−r, that is, the average of entries of the submatrix has absolute

3

value at most 2−r. We show that any learning algorithm for the learning problem corresponding

to M requires either Ω (k · ℓ) memory, or at least 2Ω(r) samples. The result holds even if the

learner has an exponentially small success probability (of 2−Ω(r)).

In particular, this shows that for a large class of learning problems, any learning algorithm

requires either a memory of size at least Ω ((log |X|) · (log |A|)) or an exponential number of

samples, achieving a tightΩ ((log |X|) · (log |A|)) lower boundon the size of thememory, rather

than a bound of Ω (min {(log |X|)2, (log |A|)2}) obtained in previous works [Raz17, MM18].

The proof builds on [Raz17] that gave a general technique for proving memory-samples

lower bounds. We prove this result in Chapter 2.

In the next result of this thesis, we generalize the memory-sample lower bounds to when

the learner is allowed a second pass over the stream of samples. The only other previous work

that proved memory-samples lower bounds for more than one pass over the stream of samples,

is [DS18]; which uses very different techniques from communication complexity and proves at

most a polynomial lower bound on the number of samples.

Contribution 2 The next main contribution of this thesis is proving the first memory-

samples lower bound (with a super-linear lower boundon thememory size and super-polynomial

lower bound on the number of samples) for learning, when the learner is allowed two passes over

the stream of samples. For example, we prove that any two-pass algorithm for parity learning,

that is, learning an unknown n-bit vector from random linear equations in F2, requires either a

memory of size Ω(n1.5) or at least 2Ω(√n) samples. This is joint work with Ran Raz and Avishay

Tal [GRT19].

As before, we consider the learning problem corresponding to the matrix M. Assume that

k, ℓ, r are such that any submatrix ofM of at least 2−k · |A| rows and at least 2−ℓ · |X| columns,

has a bias of at most 2−r. We show that any two-pass learning algorithm for the learning problem

corresponding to M requires either a memory of size at least Ω(k ·min{k,

√ℓ}), or at least

4

2Ω(min{k,√ℓ,r}) samples.

The proof builds on the works of [Raz17, GRT18]. However, these works are heavily based

on the fact that in the one-pass case, all the samples are independent and hence at each time step,

the learning algorithm has no information about the next sample that it is going to see, which is

not the case for the second pass. We prove this result in Chapter 3.

Another application of suchmemory-sample trade-offs for learning is in the field of bounded-

storage cryptography. First introduced by [Mau92], bounded-storage cryptography studies pro-

tocols that are secure against memory-bounded adversaries [CM97, AR99, ADR02, Vad03,

DM04, TT18]. Using the trade-off for parity learning, [Raz16] provided an encryption scheme

that uses a private key of length n, takes n time to encrypt and decrypt each bit, and which is

unconditionally secure against attackers with memory less than n2/25 bits, as long as the scheme

is used at most an exponential number of times. This was the first work to prove security, for

a cryptographic protocol, against an attacker that uses super-linear memory (super-linear in the

time needed to encrypt/decrypt). Furthermore, using the same result, [GZ19] constructed new

protocols for two-party key agreement, bit commitment, and oblivious transfer in the bounded-

storage model. In the next result of the thesis, we generalize the memory-sample trade-offs to

distinguishing distributions with new applications to bounded-storage cryptography.

Contribution 3 The next main contribution of this thesis, is establishing sample lower-

bounds for memory bounded algorithms, which distinguish between natural pairs of related dis-

tributions, using samples that arrive in a streaming setting. This is joint work with Pravesh K.

Kothari and Ran Raz [GKR20].

Firstly, we show that any algorithm that distinguishes betweenuniformdistributionon{0, 1}n

and uniform distribution on an n/2-dimensional linear subspace of {0, 1}n with non-negligible

advantage requires either Ω(n2)memory or 2Ω(n) samples (almost tight).

Then, we prove lower bounds for distinguishing outputs of Goldreich’s local pseudorandom

5

generator [Gol00] from the uniform distribution on the output domain. Specifically, Goldre-

ich’s pseudorandomgeneratorGfixes a predicateP : {0, 1}k → {0, 1} and a collection of subsets

S1, S2, . . . , Sm ⊆ [n] of size k. For any seed x ∈ {0, 1}n, it outputs P(xS1),P(xS2), . . . ,P(xSm),

where xSi is the projection of x to the coordinates in Si. We consider the streaming version of the

pseudorandom generator and prove that, whenever P is t-resilient (all non-zero Fourier coeffi-

cients of (−1)P are of degree t or higher), then no algorithm, with< nε memory (0 < ε < 1),

can distinguish the output ofG fromuniformdistribution on{0, 1}m, for stretchm <(nt

) (1−ε)36 t.

In the streamingmodel, at each time step i, Si ⊆ [n] is a randomly chosen (ordered) subset of size

k and the distinguisher sees either P(xSi) or a uniformly random bit along with Si.

Thus, in the streamingmodel, we rule outmemory-bounded attacks onGoldreich’s generator,

even for super-linear stretch, whenever the predicateP satisfies the t-resiliency property, identified

as “necessary” in various prior works [AL16, KMOW17].

The proofs build on [Raz17, GRT18]. We prove the results in Chapter 4.

0.2 Space-Bounded Randomized Algorithms

In this section, we investigate the implications of bounded space for randomized algorithms –

whether randomization gives space-bounded algorithms any advantage over deterministic com-

putation. Randomness is prevalent across scientific disciplines and understanding the role of

randomness in computation is an important part of complexity theory. While randomness is

provably necessary in many computational settings such as cryptography, distributed comput-

ing, and interactive proofs (see [Wig19]), by now it is widely believed that randomness adds no

computational power to time-bounded nor to space-bounded algorithms. Surprisingly, proving

such a statement for time-bounded algorithms implies circuit lower bounds which seem to be

out of reach of current proof techniques [NW94, IKW02, KI04]. On the other hand, there is no

known barrier for proving such a statement in the space-bounded setting.

While we cannot even rule out if randomness could speed up time-bounded computation ex-

ponentially, the space-bounded setting is much better understood. Savitch’s theorem [Sav70]

6

already implies that any randomized algorithm with one-sided error (in RL) can be simulated

deterministically with only a quadratic overhead in space. BPL ⊆ L2 (any O(log n)-space

randomized algorithm, that correctly accepts or rejects the input with probability 2/3, can be

simulated by a O(log2 n)-space deterministic algorithm) can be proved easily through a vari-

ant of Savitch’s theorem and also follows from [BCP83]. [Nis92, Nis94] proved that BPL ⊆

DTISP(poly(n), log2(n)) using pseudorandom generators. The state of the art result was ob-

tainedby [SZ99] that build onNisan’swork to simulate space-s randomized algorithmswith two-

sided error, by deterministic space-O(s3/2) computation, thus, establishing that BPL ⊆ L3/2.

Another celebrated result in the field of space-bounded computation is that of [Rei08],

which proved SL = L, that is, connectivity on undirected graphs can be solved by O(log n)-

space deterministic computation. Proving such a statement on directed graphs would imply

NL = L and approximating random walks on directed graphs would imply BPL = L. There

has been much work on the study of derandomizing space-bounded computation (for example,

[NSW92, ATSWZ00, RR99, Tri08, DSTS17, MRSV17, AKM+19, CH20]).

0.2.1 PseudorandomDistributions for ROBPs

Space-bounded algorithms are typically studied by considering their non-uniform counterparts.

A length n, width w read-once branching program (ROBP) is a directed graph whose nodes,

called states, are partitioned to n layers, each with at mostw states, as well as an additional “start”

state. The last layer consists of 2 states called “accept” and “reject”. From every state but for the

latter two, there are two outgoing edges, labeled by 0 and 1, to the following layer. On input

x ∈ {0, 1}n, the computation proceeds by following the edges labelled by the bits of x, starting

from the start state ( and using xi to go from layer i to i + 1). The string x is accepted by the

program if the computation ends in the accept state.

It is easy to see ( [AB09]) that any space-s randomized computation (on a given input) in the

Turingmodel, which uses n = 2O(s) randombits, can be simulated by a length n, widthwROBP

with w = 2O(s). Thus, one approach to derandomize two-sided error space-bounded algorithms

7

is to construct, in bounded space, a distributionof small support that “looks random” to any such

ROBP (and hence, which can be used to replace the random bits). We say that a distributionD

on n-bit strings is (n,w, ε)-pseudorandom if for every length n, width w ROBP, a path (string)

that is sampled fromD has, up to an additive error ε, the same probability to end in the accept

state as a truly random path. A truly random path corresponds to a path picked uniformly at

random from the 2n possible paths. An (n,w, ε)-pseudorandom generator (PRG) is an algorithm

({0, 1}l → {0, 1}n) that takes in l bits and outputs n bits such that the output distribution

(uniform distribution over the range) is (n,w, ε)-pseudorandom. The seed length of a PRG is

the number of random bits it requires to generate the distribution, that is, l. Informally, we call

the PRG explicit, if each output bit can be computed, given the input and the index, inO(log n)-

space.

Derandomizing using an explicit pseudorandom distribution is straightforward. By iterating

over all the paths in the support of the distribution (2l of them) and calculating the fraction that

end in the accept state, one obtains an ε-approximation to the true probability of reaching the

accept state. For ε < 1/3, such an approximation is enough to distinguish between≤ 1/3 and

≥ 2/3 acceptance probabilities of the randomized algorithm. The support size being small (or,

equivalently, the seed l being short) allows one to perform such an iteration in space-l.

One can prove the existence of an (n,w, ε)-PRGwith seed lengthO(log(nw/ε)). The proof is

non-constructive and hence, the PRG isn’t explicit. Nisan [Nis92] gave an explicit construction

of a PRGwith seed lengthO(log n · log(nw/ε)). Setting n,w = 2Θ(s) and ε to a small constant,

the seed length isO(s2)which yields derandomization with quadratic overhead in space. [SZ99]

applied Nisan’s generator in a far more sophisticated way than the naive derandomization so to

obtain their result.

There has been much success in constructing PRGs for restricted types of ROBPs (see,

e.g., [INW94, NZ96, ATSWZ00, RTV06, BPW11, Ste12, BPW12, KNP11, KNP11, De11,

IMZ12, GMR+12, GMRZ13, RSV13, SVW14, BRRY14, GV17, FK18, MRT19] and refer-

ences therein). However, for unrestricted ROBPs, no improvement over Nisan’s generator has

8

been made, for the general regime of parameters.

While pseudorandom distributions are suitable for derandomizing two-sided error random-

ized algorithms, hitting sets are suitable for one-sided error. An (n,w, ε)-hitting set is a set of

n-bit strings such that for every length n, width w ROBP, whenever a truly random path ends

in the accept state with probability at least ε, then there exists a path in the set that ends at the

accept state. Hitting sets can be used to derandomizeRL. Even for the deceptively simple look-

ing problem of constructing hitting sets for width w = 3 ROBPs, no progress was made for

nearly two decades, until the works of [ŠŽ11, GMR+12]. In particular, [GMR+12] construct

near-optimal hitting sets in that setting. The best known explicit hitting set (upto polyloglog

factors), for general regime of parameters, was in fact Nisan’s PRG [Nis92] (with seed length

O(log n · log(nw/ε))) until the following work.

Contribution 4 The nextmain contribution of this thesis, is constructing a hitting set with

seed length O(log2 n+log n · logw+log(1/ε)) (obtaining near-optimal dependence on ε). This

is joint work withMark Braverman and Gil Cohen [BCG18].

The regime of parameters in which our construction strictly improves upon prior works,

namely, log(1/ε) ≫ log n, is also motivated by [SZ99], which uses error-ε pseudorandom gen-

erators for length-nwidth-w read-once branching programs, such thatw, 1/ε = 2(log n)2 , to prove

BPL ⊆ L3/2.

In fact, we introduce and construct a new derandomization tool – pseudorandom pseudo-

distributions. Informally, this is a generalization of pseudorandom generators, in which one

may assign negative and unbounded weights to paths as opposed to working with probability

distributions. We show that pseudorandom pseudo-distributions yield hitting sets, and can be

used to derandomize two-sided error algorithms. We give the construction in Chapter 5.

Subsequently, the above work inspired simpler constructions of hitting sets and pseudoran-

dom pseudo-distributions, that obtain optimal dependence on error (improving the seed length,

9

given above, by polyloglog factors). [HZ18] gave a much simpler construction for hitting sets

and [CL20] simplified the above construction for pseudorandom pseudo-distributions.

0.3 Space Requirements ofMirror Game

In this section, we investigate the implications of bounded-space for the following streaming

game between two players Alice and Bob, which we call the mirror game.

Alice and Bob take turns saying numbers belonging to the set {1, 2, . . . ,N}. If either player

says a number that has previously been said, they lose. Otherwise, afterN turns, all the numbers

in the set have been spoken aloud, and both players win. Alice says the first number.

If N is even, there is a very simple and low-memory (O(logN) memory bits) strategy that

allows Bob to win the game, regardless of Alice’s strategy: whenever Alice says x, Bob replies

withN+ 1− x. This is an example of amirror strategy (and for this reason, we refer to the game

above as themirror game). Mirror strategies are an effective tool for figuring out who wins in a

variety of combinatorial games (for example, two-pile Nim [BCG03]).

This leads to the following natural question: can Alice, with bounded-memory, avoid losing

whenN is even? Since both players have access to the same set of actions, one may be tempted to

believe that the answer is yes - in fact, ifN is odd, thenAlice can start by saying the numberN and

then adopt the mirror strategy described above (for Bob) for a set ofN − 1 elements. However,

whenN is even, the mirror strategy as stated does not work.

In the next result of the thesis, we answer the question in the negative; any successful, deter-

ministic strategy for Alice requires at least Ω(N) bits of memory.

Contribution 5 The next main contribution of the thesis is proving, for the mirror game

defined above, that if N is even, then any winning strategy for Alice requires at least (log2 5 −

2)N− o(N) bits of space. This is joint work with Jon Schneider [GS18].

While many tools exist in the computer science literature for showing space lower bounds

(such as communication complexity, information theory), one interesting feature of this

10

problem, absent from many others, is that any space lower bound such as above must depend

crucially on the parity ofN. We prove the lower bound in Chapter 6.

In [GS18], we additionally demonstrate a randomized strategy for Alice that wins with high

probability (1 − 1/N) and requires only O(√N) space (provided that Alice has access to a ran-

dommatching onKN – a complete graph onN vertices). Subsequently, [Fei19] modified Alice’s

randomized strategy such that it uses only O(log3 N) space. There hasn’t been much work on

proving space complexity of winning strategies for streaming games. [CC17] proves time-space

tradeoffs for a certain memory game, but the game and the techniques are very different.

0.4 Organisation of the Thesis

Chapter 1 establishes notations and covers the necessary preliminaries. We present the extractor-

based approach to memory-sample trade-offs for learning in Chapter 2. We generalize the

memory-sample trade-offs for two-pass learning in Chapter 3. We present the applications of

memory-sample trade-offs to the security ofGoldreich’s local pseudorandom generator inChap-

ter 4. Weconstructpseudorandompseudo-distributions for read-oncebranchingprograms,with

near-optimal dependence on the error parameter, in Chapter 5. We establish space complexity of

the mirror game in Chapter 6. Finally, we conclude in Chapter 7.

11

Optimism is the faith that leads to

achievement.

Helen Keller

1Preliminaries

All logarithms (log) in the thesis are of base 2. We use [n] to denote the set {1, 2, . . . , n}, for

any positive integer n. Throughout the thesis, we use Big O notation to describe the growth

of a quantity (for example, space, samples, seed length). Let f, g : R → R be two real valued

functions. Then, we say f(x) = O(g(x)) if and only if there exists a positive real numberM and

a real number x0 such that

f(x) ≤ M · g(x) for all x ≥ x0

We say f(x) = Ω(g(x)) if and only if g(x) = O(f(x)). We say f(x) = Θ(g(x)) if and only if

f(x) = O(g(x)) and g(x) = O(f(x)).

Next, we give a brief overview of the complexity classes mentioned in the thesis. A language

12

L is a subset of binary strings (⊆ {0, 1}∗). We say an algorithm identifies L if it outputs 1 on all

strings belonging to L and 0 otherwise. A complexity class is usually defined as a set of languages

that satisfy certain properties, such as follows.

1. P contains all languages that can be identified by a deterministic Turing machine using

polynomial amount of time, that is, O(nc) time for every n-bit input string (where c is a

constant independent of n).

2. NP contains all languages that can be identified in polynomial time by a non-deterministic

Turing machine. In other words, a language L is inNP, if for all x ∈ L, there exists proof

of the fact that x is inL, which is verifiable in polynomial time. coNP is the set of languages

L such that complement of L ({0, 1}∗ \ L) is inNP.

3. L is the set of languages that can be identified by a deterministic Turing machine using

O(log n) amount of writable memory space for every n-bit input string (where the input

is stored in a separate read only tape). NL contains all languages that can be identified in

logarithmic space by a non-deterministic Turing machine, and coNL is defined as before.

4. BPL contains all languagesL for which there exists a probabilistic TuringmachineM that

uses O(log n) writing space (and polynomial time), such that if x ∈ L, thenM accepts x

(outputs 1) with probability at least 2/3 and if x ∈ L,M accepts xwith probability atmost

1/3. RL contains all languagesL that can be identified byO(log n)-space polynomial-time

probabilistic Turing machineM with one-sided error. That is, if x ∈ L, thenM accepts

with probability at least 1/2 and if x ∈ L,M accepts xwith probability 0.

5. DTISP(poly(n), log2(n)) contains all all languages by a deterministic Turing machine

using polynomial time andO(log2(n))writing space.

13

1.1 Space-Bounded Learning

In this section, we establish certain basic preliminaries and notations for Chapters 2,3 and 4. We

use UX : X → R+ to denote the uniform distribution over X. We use( n≤k

)to denote

(n0

)+(n

1

)+ . . .+

(nk

). For a random variableZ and an event E, we denote byPZ the distribution of the

randomvariablesZ, andwedenote byPZ|E the distribution of the randomvariableZ conditioned

on the event E.

1.1.1 Viewing a Learning Problem as aMatrix

Let X, A be two finite sets of size larger than 1. Let n = log2 |X|. LetM : A× X → {−1, 1} be

a matrix. The matrixM corresponds to the following learning problem: There is an unknown

element x ∈ X that was chosen uniformly at random. A learner tries to learn x from samples

(a, b), where a ∈ A is chosen uniformly at random and b = M(a, x). That is, the learning

algorithm is given a streamof samples, (a1, b1), (a2, b2) . . ., where eachat is uniformlydistributed

and for every t, bt = M(at, x).

1.1.2 Norms and Inner Products

Let p ≥ 1. For a function f : X → R, denote by ∥f∥p the ℓp norm of f, with respect to the

uniform distribution over X, that is:

∥f∥p =(

Ex∈RX

[|f(x)|p])1/p

.

For two functions f, g : X → R, define their inner product with respect to the uniform

distribution over X as

⟨f, g⟩ = Ex∈RX

[f(x) · g(x)].

For a matrix M : A × X → R and a row a ∈ A, we denote by Ma : X → R the function

14

corresponding to the a-th row ofM. Note that for a function f : X → R, we have

⟨Ma, f⟩ =(M · f)a|X|

.

1.1.3 L2-Extractors and L∞-Extractors

We use a certain extractor property of the matrixM to obtain the memory-sample lower bounds

in the thesis.

Definition 1.1.1. L2-Extractor: Let X,A be two finite sets. A matrix M : A× X → {−1, 1} is

a (k, ℓ)-L2-Extractor with error 2−r, if for every non-negative f : X → R with ∥f∥2∥f∥1 ≤ 2ℓ there are

at most 2−k · |A| rows a in A with|⟨Ma, f⟩|∥f∥1

≥ 2−r .

Let Ω be a finite set. We denote a distribution over Ω as a function f : Ω → R+ such that∑x∈Ω f(x) = 1. We say that a distribution f : Ω → R+ has min-entropy k if for all x ∈ Ω, we

have f(x) ≤ 2−k.

Definition 1.1.2. L∞−Extractor: Let X,A be two finite sets. A matrixM : A× X → {−1, 1}

is a (k, ℓ ∼ r)-L∞-Extractor if for every distribution px : X → R+ with min-entropy at least

(log(|X|)− ℓ) and every distribution pa : A → R+ with min-entropy at least (log(|A|)− k),

∣∣∣∣∑a′∈A

∑x′∈X

pa(a′) · px(x′) ·M(a′, x′)∣∣∣∣ ≤ 2−r.

1.1.4 Branching Program for a Learning Problem

In the following definition, we model the learner for the learning problem that corresponds to

the matrixM, by a branching program.

Definition 1.1.3. BranchingProgram for a Learning Problem: Abranching programof length

m and width d, for learning, is a directed (multi) graph with vertices arranged in m + 1 layers

15

containing at most d vertices each. In the first layer, that we think of as layer 0, there is only one

vertex, called the start vertex. A vertex of outdegree 0 is called a leaf. All vertices in the last layer are

leaves (but there may be additional leaves). Every non-leaf vertex in the program has 2|A| outgoing

edges, labeled by elements (a, b) ∈ A× {−1, 1}, with exactly one edge labeled by each such (a, b),

and all these edges going into vertices in the next layer. Each leaf v in the program is labeled by an

element x(v) ∈ X, that we think of as the output of the program on that leaf.

Computation-Path: The samples (a1, b1), . . . , (am, bm) ∈ A × {−1, 1} that are given as

input, define a computation-path in the branching program, by starting from the start vertex and

following at step t the edge labeled by (at, bt), until reaching a leaf. The program outputs the label

x(v) of the leaf v reached by the computation-path.

Success Probability: The success probability of the program is the probability that x = x,

where x is the element that the program outputs, and the probability is over x, a1, . . . , am (where

x is uniformly distributed over X and a1, . . . , am are uniformly distributed over A, and for every t,

bt = M(at, x)).

1.2 Read-Once Branching Programs

In this section, we establish certain basic preliminaries for Chapter 5 and define pseudorandom

generators and hitting sets for read-once branching programs. For an integer n ≥ 1, we useUn to

denote the uniformdistribution overn-bit strings. Wenext define read-once branching programs

(ROBPs), but the reader should not confuse the definition with the branching program (BP)

defined in Section 1.1.4 for a learning problem. Although, the underlying computational model

is the same, the input domain is very different for the two branching programs and throughout

the thesis, we will distinguish them using the acronyms BP and ROBP.

Definition 1.2.1. Let n,w ≥ 1 be integers. An (n,w)-read-once branching program (ROBP for

short) P is a directed graph on the vertex set V = {s} ∪⋃n

i=1 Pi, where the Pi’s are disjoint sets of

size w each. We refer to Pi as layer i of the program P. From every node but for those that belong to

Pn there are two outgoing edges, labeled by 0 and 1. The pair of edges from s ends in P1 and for every

16

1 ≤ i < n and v ∈ Pi, the pair of edges going out of v end in nodes that belong to Pi+1. There are

no edges leaving Pn. The node s is called the start node of the program P.

Given a string p ∈ {0, 1}ℓ, with ℓ ≤ n, we denote by P(p) the node that is reached by traversing

theROBPP according to the path p starting at the start node. The set of all (w, n)-ROBPs is denoted

byPw,n.

Given any distribution D on {0, 1}n, we use P(D) to denote the distribution over v ∈ Pn

when the input to P is distributed according toD.

Definition 1.2.2 (Hitting sets). A set {p1, . . . , p2s} ⊆ {0, 1}n is an (n,w, ε)-hitting set if for

every P ∈ Pw,n and node v ∈ Pn for which Pr[P(Un) = v] ≥ ε, there exists j ∈ [2s] such that

P(pj) = v.

It is sometimes convenient to address the function that generates the hitting set.

Definition 1.2.3 (Hitting set generators). A function HSG : {0, 1}s → {0, 1}n is an (n,w, ε)-

hitting set generator (HSG for short) if the image ofHSG is an (n,w, ε)-hitting set. We refer to the

input ofHSG as the seed. Note that 2s is an upper bound on the size of the hitting set.

Definition 1.2.4 (Pseudorandom distributions). A distribution D over n-bit string is an

(n,w, ε)-pseudorandom distribution if for every P ∈ Pw,n and v ∈ Pn,

∣∣Pr[P(Un) = v]− Pr[P(D) = v]∣∣ ≤ ε.

Clearly, the support of every (n,w, ε)-pseudorandom distribution is an (n,w, ε′)-hitting set

for any ε′ > ε. As with hitting sets, it is sometimes convenient to address the function that

generates the pseudorandom distribution.

Definition 1.2.5 (Pseudorandom generators). A function PRG : {0, 1}s → {0, 1}n is an

(n,w, ε)-pseudorandom generator (PRG for short) if the distribution PRG(Us) is (n,w, ε)-

pseudorandom. We refer to the input of PRG as the seed.

17

Do what you feel in your heart to be right -

for you’ll be criticized anyway.

Eleanor Roosevelt

2Space-Bounded Learning Algorithms

The results in this chapter are based on joint work with Ran Raz and Avishay Tal [GRT18].

Can one prove unconditional lower bounds on the number of samples needed for learning,

under memory constraints? The study of the resources needed for learning, under memory con-

straints was initiated by Shamir [Sha14] and by Steinhardt, Valiant and Wager [SVW16]. While

the main motivation for studying this question comes from learning theory, the problem is also

relevant to computational complexity and cryptography [Raz16, VV16, KRT17].

Steinhardt, Valiant and Wager conjectured that any algorithm for learning parities of size n

requires either a memory of size Ω(n2) or an exponential number of samples. This conjecture

was proven in [Raz16], showing for the first time a learning problem that is infeasible under

18

super-linear memory constraints. Building on [Raz16], it was proved in [KRT17] that learning

parities of sparsity ℓ is also infeasible under memory constraints that are super-linear in n, as long

as ℓ ≥ ω(log n/ log log n). Consequently, learning linear-size DNF Formulas, linear-size Deci-

sion Trees and logarithmic-size Juntas were all proved to be infeasible under super-linear mem-

ory constraints [KRT17] (by a reduction from learning sparse parities). Can one prove similar

memory-samples lower bounds for other learning problems?

As in [Raz17], we represent a learning problem by a matrix. Let X, A be two finite sets of size

larger than 1 (where X represents the concept-class that we are trying to learn and A represents

the set of possible samples). LetM : A × X → {−1, 1} be a matrix. The matrixM represents

the following learning problem: An unknown element x ∈ Xwas chosen uniformly at random.

A learner tries to learn x from a stream of samples, (a1, b1), (a2, b2) . . ., where for every i, ai ∈ A

is chosen uniformly at random and bi = M(ai, x). Let n = log |X| and n′ = log |A|.

A general technique for proving memory-samples lower bounds was given in [Raz17]. The

main result of [Raz17] shows that if the norm of the matrix M is sufficiently small, then any

learning algorithm for the corresponding learning problem requires either a memory of size at

least Ω((min{n, n′})2

), or an exponential number of samples. This gives a general memory-

samples lower bound that applies for a large class of learning problems.

Independently of [Raz17],Moshkovitz andMoshkovitz also gave a general technique for prov-

ing memory-samples lower bounds [MM17]. Their initial result was that ifM has a (sufficiently

strong) mixing property then any learning algorithm for the corresponding learning problem

requires either a memory of size at least 1.25 · min{n, n′} or an exponential number of sam-

ples [MM17]. In a recent subsequent work [MM18], they improved their result, and obtained

a theorem that is very similar to the one proved in [Raz17]. (The result of [MM18] is stated in

terms of a combinatorial mixing property, rather than matrix norm. The two notions are closely

related (see in particular Corollary 5.1 and Note 5.1 in [BL06])).

19

2.0.1 Main Results of this Chapter

The results of [Raz17] and [MM18] gave a lower bound of at most Ω((min{n, n′})2

)on the

size of thememory, whereas the best that one could hope for, in the information theoretic setting

(that is, in the setting where the learner’s computational power is unbounded), is a lower bound

of Ω (n · n′), which may be significantly larger in cases where n is significantly larger than n′, or

vice versa.

In this work [GRT18], we build on [Raz17] and obtain a general memory-samples lower

bound that applies for a large class of learning problems and shows that for every problem in that

class, any learning algorithm requires either a memory of size at least Ω (n · n′) or an exponential

number of samples.

The main result is stated in terms of the properties of the matrix M as a two-source extrac-

tor. Two-source extractors, first studied by Santha and Vazirani [SV84] and Chor and Goldre-

ich [CG88], are central objects in the study of randomness and derandomization. We show

that even a relatively weak two-source extractor implies a relatively strongmemory-samples lower

bound. We note that two-source extractors have been extensively studied in numerous of works

and there are known techniques for proving that certain matrices are relatively good two-source

extractors.

The main result can be stated as follows (Corollary 2.3.3): Assume that k, ℓ, r are such that

any submatrix ofM of at least 2−k · |A| rows and at least 2−ℓ · |X| columns, has a bias of at most

2−r. Then, any learning algorithm for the learning problem corresponding toM requires either

a memory of size at least Ω (k · ℓ), or at least 2Ω(r) samples. The result holds even if the learner

has an exponentially small success probability (of 2−Ω(r)). Amore detailed result, in terms of the

constants involved, is stated in Theorem 2.3.1 in terms of the properties ofM as anL2-Extractor,

a new notion that we define inDefinition 2.1.1, and is closely related to the notion of two-source

extractor. (The two notions are equivalent up to small changes in the parameters.)

All the new results in this chapter (and all applications) hold even if the learner is only required

to weakly learn x, that is, to output a hypothesis h : A → {−1, 1}with a non-negligible correla-

20

tion with the x-th column of the matrixM. We prove in Theorem 2.3.2 that even if the learner

is only required to output a hypothesis that agrees with the x-th column ofM on more than a

1/2+ 2−Ω(r) fraction of the rows, the success probability is at most 2−Ω(r).

As in [Raz16,KRT17,Raz17] and Section 1.1.4, wemodel the learning algorithmby a branch-

ing program. A branching program is the strongest andmost generalmodel to use in this context.

Roughly speaking, the model allows a learner with infinite computational power, and bounds

only the memory size of the learner and the number of samples used.

As mentioned above, the result implies all previous memory-samples lower bounds, as well as

new applications. In particular:

1. Parities: A learner tries to learn x = (x1, . . . , xn) ∈ {0, 1}n, from random linear equa-

tions over F2. It was proved in [Raz16] (and follows also from [Raz17]) that any learning

algorithm requires either a memory of size Ω(n2) or an exponential number of samples.

The same result follows by Corollary 2.3.3 and the fact that inner product is a good two-

source extractor [CG88].

2. Sparse parities: A learner tries to learn x = (x1, . . . , xn) ∈ {0, 1}n of sparsity ℓ, from ran-

dom linear equations over F2. In Section 2.4.2, we reprove the main results of [KRT17].

In particular, any learning algorithm requires:

(a) Assuming ℓ ≤ n/2: either a memory of size Ω(n · ℓ) or 2Ω(ℓ) samples.

(b) Assuming ℓ ≤ n0.9: either a memory of size Ω(n · ℓ0.99) or ℓΩ(ℓ) samples.

3. Learning from sparse linear equations: A learner tries to learn x = (x1, . . . , xn) ∈

{0, 1}n, from random sparse linear equations, of sparsity ℓ, over F2. In Section 2.4.3, we

prove that any learning algorithm requires:

(a) Assuming ℓ ≤ n/2: either a memory of size Ω(n · ℓ) or 2Ω(ℓ) samples.

(b) Assuming ℓ ≤ n0.9: either a memory of size Ω(n · ℓ0.99) or ℓΩ(ℓ) samples.

21

4. Learning from low-degree equations: A learner tries to learn x = (x1, . . . , xn) ∈

{0, 1}n, from random multilinear polynomial equations of degree at most d, over F2. In

Section 2.4.4, we prove that if d ≤ 0.99 · n, any learning algorithm requires either a mem-

ory of size Ω(( n≤d

)· n/d

)or 2Ω(n/d) samples.

5. Low-degree polynomials: A learner tries to learn an n′-variate multilinear polynomial

p of degree at most d over F2, from random evaluations of p over Fn′2 . In Section 2.4.5,

we prove that if d ≤ 0.99 · n′, any learning algorithm requires either a memory of size

Ω(( n′≤d

)· n′/d

)or 2Ω(n′/d) samples.

6. Error-correcting codes: A learner tries to learn a codeword from random coordinates:

Assume thatM : A × X → {−1, 1} is such that for some |X|−1 ≤ ε < 1, any pair of

different columns ofM, agree on at least 1−ε2 · |A| and at most 1+ε

2 · |A| coordinates. In

Section 2.4.6, we prove that any learning algorithm for the learning problem correspond-

ing toM requires either a memory of size Ω((log |X|) · (log(1/ε))

)or( 1ε

)Ω(1) samples.

We also point to a relation between the results in this chapter and statistical-query dimen-

sion [Kea98, BFJ+94].

7. Random matrices: Let X,A be finite sets, such that, |A| ≥ (2 log |X|)10 and |X| ≥

(2 log |A|)10. Let M : A × X → {−1, 1} be a random matrix. Fix k = 12 log |A| and

ℓ = 12 log |X|. With very high probability, any submatrix ofM of at least 2−k ·|A| rows and

at least 2−ℓ · |X| columns, has a bias of at most 2−Ω(min{k,ℓ}). Thus, byCorollary 2.3.3, any

learning algorithm for the learning problem corresponding toM requires either amemory

of size Ω ((log |X|) · (log |A|)), or(min{|X|, |A|}

)Ω(1) samples.

We note also that the results about learning from sparse linear equations have applications in

bounded-storage cryptography. This is similar to [Raz16, KRT17], but in a different range of

the parameters. In particular, for every ω(log n) ≤ ℓ ≤ n, the main result gives an encryption

scheme that requires a private key of length n, and time complexity of O(ℓ log n) per encryp-

tion/decryption of each bit, using a random access machine. The scheme is provenly and uncon-

22

ditionally secure as long as the attacker uses at most o(nℓ)memory bits and the scheme is used at

most 2o(ℓ) times.

Generalization to Non-Product Distributions In addition to all these results, we

give in Section 2.5 a generalization of Theorem 2.3.1 to the case where the samples a ∈ A de-

pend on the unknown concept x ∈ X. In this case, b is redundant and the learning problem is

described by the joint distribution p : A × X → [0, 1] of joint random variable (A,X ). The

joint distribution corresponds to the following learning problem: An unknown element x ∈ X

was chosen uniformly at random. A learner tries to learn x from a stream of samples, a1, a2, . . .,

where for every i, ai ∈ A is chosen (independently) according to the conditional distribution

pA|X=x. We stress that in Section 2.5, the joint distribution pA,X is not a product distribution.

We assume for simplicity that the marginal pX is the uniform distribution over X.

The main result in Section 2.5, Theorem 2.5.1, requires that for some p, the distribution

pA|X=x is bounded by 2p · pA (that is, for every a′ ∈ A, x′ ∈ X, Pr(A = a′|X = x′) ≤

2p · Pr(A = a′)). We view this assumption as quite natural and general. The assumption limits

the information that each sample gives about the concept to at most p bits. In addition, the the-

orem requires that the matrix M = pA|X=x − pA (viewed as a matrix M : A × X → [−1, 1])

satisfies an “extractor-like” property that is similar to the ones used in Theorem 2.3.1 and Corol-

lary 2.3.3. Roughly speaking, the property holds if for some k, ℓ, r, any submatrix of M of at least

2−k probability mass of rows (under the distribution pA), and at least 2−ℓ · |X| columns, satis-

fies the following: In almost every row a′ (of the submatrix), the average of all entries is at most

2−r · pA(a′), (and in that sense the row is roughly unbiased). Under these assumptions, The-

orem 2.5.1 shows that any learning algorithm for the corresponding learning problem requires

either amemory of size at least Ω(

k·ℓp

), or at least 2Ω(r) samples. (Intuitively, we loose a factor of

p in the bound on the memory-size, because each sample amay give up to p bits of information

about the concept x).

We note that besides the obvious motivation of studying non-product distributions of con-

23

cepts and samples, Theorem 2.5.1 can also be used to handle cases where the output b is longer

than one bit and cases where the output’s distribution is non-uniform. For example, the theo-

rem implies that for any finite field F, learning a string x ∈ Fn from random linear equations,

requires either a memory of size Ω(n2 log |F|), or an exponential number of equations. This

bound is tight and can be viewed as a generalization of the memory-samples lower bound for

parity learning, to general finite fields. (See Section 2.5.5 for more details.)

Techniques Proof follows the lines of the proof of [Raz17] and builds on that proof. The

proof of [Raz17] considered the norm of the matrixM, and thus essentially reduced the entire

matrix to only one parameter. In The proof, we consider the properties of M as a two-source

extractor, and hence we have three parameters (k, ℓ, r), rather than one. Considering these three

parameters, rather than one, enables a more refined analysis, resulting in a stronger lower bound

with a slightly simpler proof. A proof outline is given in Section 2.2.

Motivation and Discussion Many previous works studied the resources needed for

learning, under certain information, communication or memory constraints (see in particu-

lar [Sha14, SVW16, Raz16, VV16, KRT17, MM17, Raz17, MT17, MM18] and the many ref-

erences given there). A main message of some of these works is that for some learning problems,

access to a relatively large memory is crucial. In other words, in some cases, learning is infeasible,

due to memory constraints. From the point of view of human learning, such results may help

to explain the importance of memory in cognitive processes. From the point of view of machine

learning, these results imply that a large class of learning algorithms cannot learn certain concept

classes. In particular, this applies to any bounded-memory learning algorithm that considers the

samples one by one. In addition, these works are related to computational complexity and have

applications in cryptography.

Independently of [GRT18], Beame, Oveis Gharan and Yang also gave a combinatorial prop-

erty of a matrixM, that holds for a large class of matrices and implies that any learning algorithm

for the corresponding learning problem requires either amemory of sizeΩ ((log |X|) · (log |A|))

24

or an exponential number of samples (when |A| ≤ |X|) [BGY18]. Their property is based on

a measure of howmatrices amplify the 2-norms of probability distributions that is more refined

than the 2-norms of these matrices. Their proof also builds on [Raz17]. They also show, as an

application, tight time-space lower bounds for learning low-degree polynomials, as well as other

applications.

2.1 Preliminaries

Before proceeding to the main result, we recall certain preliminaries from Section 1.1. For an

integer n, [n] represents {1, . . . , n}. We denote by UX : X → R+ the uniform distribution over

X.( n≤k

)=(n0

)+(n1

)+ . . .+

(nk

). For a, x ∈ X, a · x represents the inner product modulo 2. For

a random variableZ and an event E, we denote byPZ the distribution of the random variablesZ,

and we denote by PZ|E the distribution of the random variable Z conditioned on the event E.

Let X, A be two finite sets of size larger than 1 (n = log2 |X|). LetM : A × X → {−1, 1}

be a matrix. The matrixM corresponds to the following learning problem as defined in Section

1.1.1. An unknown element x ∈ X is chosen uniformly at random. A learner tries to learn x

from samples (a, b), where a ∈ A is chosen uniformly at random and b = M(a, x). That is, the

learning algorithm is given a stream of samples, (a1, b1), (a2, b2) . . ., where each at is uniformly

distributed and for every t, bt = M(at, x).

Norms and Inner Products

Let p ≥ 1. For a function f : X → R, denote by ∥f∥p the ℓp norm of f, with respect to the


∥f∥p =(

Ex∈RX

[|f(x)|p])1/p

.


25



[f(x) · g(x)].

For a matrixM : A × X → R and a row a ∈ A, we denote byMa : X → R the function


⟨Ma, f⟩ =(M · f)a|X|

.

L2-Extractors and L∞-Extractors




≥ 2−r .

Let Ω be a finite set. We denote a distribution over Ω as a function f : Ω → R+ such that∑x∈Ω f(x) = 1. We say that a distribution f : Ω → R+ has min-entropy k if for all x ∈ Ω, we

have f(x) ≤ 2−k.

Definition 2.1.2. L∞−Extractor: Let X,A be two finite sets. A matrixM : A× X → {−1, 1}

is a (k, ℓ ∼ r)-L∞-Extractor if for every distribution px : X → R+ with min-entropy at least

(log(|X|)− ℓ) and every distribution pa : A → R+ with min-entropy at least (log(|A|)− k),

∣∣∣∣∑a′∈A

∑x′∈X

pa(a′) · px(x′) ·M(a′, x′)∣∣∣∣ ≤ 2−r.

Branching Program for a Learning Problem

Wemodel the learner for the learning problem that corresponds to the matrixM, by a branching

program as in Section 1.1.4.

26

Definition 2.1.3. BranchingProgram for a Learning Problem: Abranching programof length

m and width d, for learning, is a directed (multi) graph with vertices arranged in m + 1 layers

containing at most d vertices each. In the first layer, that we think of as layer 0, there is only one

vertex, called the start vertex. A vertex of outdegree 0 is called a leaf. All vertices in the last layer are

leaves (but there may be additional leaves). Every non-leaf vertex in the program has 2|A| outgoing

edges, labeled by elements (a, b) ∈ A× {−1, 1}, with exactly one edge labeled by each such (a, b),

and all these edges going into vertices in the next layer. Each leaf v in the program is labeled by an

element x(v) ∈ X, that we think of as the output of the program on that leaf.

Computation-Path: The samples (a1, b1), . . . , (am, bm) ∈ A × {−1, 1} that are given as

input, define a computation-path in the branching program, by starting from the start vertex and

following at step t the edge labeled by (at, bt), until reaching a leaf. The program outputs the label

x(v) of the leaf v reached by the computation-path.



x is uniformly distributed over X and a1, . . . , am are uniformly distributed over A, and for every t,

bt = M(at, x)).

2.2 Overview of the Proof

The proof follows the lines of the proof of [Raz17] and builds on that proof.

Assume thatM is a (k, ℓ)-L2-extractor with error 2−r′ , and let r = min{k, ℓ, r′}. Let B be a

branching program for the learning problem that corresponds to the matrix M. Assume for a

contradiction that B is of lengthm = 2εr and width d = 2εkℓ, where ε is a small constant.

We define the truncated-path, T , to be the same as the computation-path of B, except that it

sometimes stops before reaching a leaf. Roughly speaking,T stops before reaching a leaf if certain

“bad” events occur. Nevertheless, we show that the probability thatT stops before reaching a leaf

is negligible, so we can think of T as almost identical to the computation-path.

For a vertex vofB, we denote byEv the event thatT reaches the vertex v. Wedenote byPr(v) =

27

Pr(Ev) the probability for Ev (where the probability is over x, a1, . . . , am), and we denote by

Px|v = Px|Ev the distribution of the random variable x conditioned on the event Ev. Similarly, for

an edge e of the branching program B, let Ee be the event that T traverses the edge e. Denote,

Pr(e) = Pr(Ee), and Px|e = Px|Ee .

A vertex v of B is called significant if

∥Px|v∥2 > 2ℓ · 2−n.

Roughly speaking, this means that conditioning on the event that T reaches the vertex v, a non-

negligible amount of information is known about x. In order to guess x with a non-negligible

success probability, T must reach a significant vertex. Lemma 2.3.1 shows that the probability

that T reaches any significant vertex is negligible, and thus the main result follows.

To prove Lemma 2.3.1, we show that for every fixed significant vertex s, the probability that T

reaches s is at most 2−Ω(kℓ) (which is smaller than one over the number of vertices in B). Hence,

we can use a union bound to prove the lemma.

The proof that the probability that T reaches s is extremely small is themain part of the proof.

To that end, we use the following functions to measure the progress made by the branching pro-

gram towards reaching s.

Let Li be the set of vertices v in layer-i of B, such that Pr(v) > 0. Let Γi be the set of edges e

from layer-(i− 1) of B to layer-i of B, such that Pr(e) > 0. Let

Zi =∑v∈Li

Pr(v) · ⟨Px|v,Px|s⟩k,

Z ′i =∑e∈Γi

Pr(e) · ⟨Px|e,Px|s⟩k.

We think ofZi,Z ′i as measuring the progress made by the branching program, towards reaching

a state with distribution similar to Px|s.

We show that eachZimayonly benegligibly larger thanZi−1. Hence, since it’s easy to calculate

28

thatZ0 = 2−2nk, it follows thatZi is close to 2−2nk, for every i. On the other hand, if s is in layer-i

thenZi is at leastPr(s) · ⟨Px|s,Px|s⟩k. Thus,Pr(s) · ⟨Px|s,Px|s⟩k cannot bemuch larger than 2−2nk.

Since s is significant, ⟨Px|s,Px|s⟩k > 2ℓk · 2−2nk and hencePr(s) is at most 2−Ω(kℓ).

The proof thatZi may only be negligibly larger thanZi−1 is done in two steps: Claim 2.3.11

shows by a simple convexity argument thatZi ≤ Z ′i . The hard part, that is done in Claim 2.3.9

and Claim 2.3.10, is to prove thatZ ′i may only be negligibly larger thanZi−1.

For this proof, we define for every vertex v, the set of edges Γout(v) that are going out of v, such

that Pr(e) > 0. Claim 2.3.9 shows that for every vertex v,

∑e∈Γout(v)

Pr(e) · ⟨Px|e,Px|s⟩k

may only be negligibly higher than

Pr(v) · ⟨Px|v,Px|s⟩k.

For the proof ofClaim2.3.9, which is the hardest proof in the chapter, and themost important

place where the proof (in this chapter) deviates from (and simplifies) the proof of [Raz17], we

consider the functionPx|v ·Px|s. We first show how to bound ∥Px|v ·Px|s∥2. We then consider two

cases: If ∥Px|v ·Px|s∥1 is negligible, then ⟨Px|v,Px|s⟩k is negligible and doesn’t contributemuch, and

we show that for every e ∈ Γout(v), ⟨Px|e,Px|s⟩k is also negligible and doesn’t contribute much.

If ∥Px|v · Px|s∥1 is non-negligible, we use the bound on ∥Px|v · Px|s∥2 and the assumption thatM

is a (k, ℓ)-L2-extractor to show that for almost all edges e ∈ Γout(v), we have that ⟨Px|e,Px|s⟩k is

very close to ⟨Px|v,Px|s⟩k. Only an exponentially small (2−k) fraction of edges are “bad” and give

a significantly larger ⟨Px|e,Px|s⟩k.

The reason that in the definitions of Zi and Z ′i we raised ⟨Px|v,Px|s⟩ and ⟨Px|e,Px|s⟩ to the

power of k is that this is the largest power for which the contribution of the “bad” edges is still

small (as their fraction is 2−k).

This outline oversimplifies many details. Let us briefly mention two of them. First, it is not so

29

easy to bound ∥Px|v · Px|s∥2. We do that by bounding ∥Px|s∥2 and ∥Px|v∥∞. In order to bound

∥Px|s∥2, we forceT to stopwhenever it reaches a significant vertex (and thuswe are able to bound

∥Px|v∥2 for every vertex reached by T ). In order to bound ∥Px|v∥∞, we force T to stop whenever

Px|v(x) is large, which allows us to consider only the “bounded” part of Px|v. (This is related

to the technique of flattening a distribution that was used in [KR13]). Second, some edges are

so “bad” that their contribution to Z ′i is huge so they cannot be ignored. We force T to stop

before traversing any such edge. (This is related to an idea that was used in [KRT17] of analyzing

separately paths that traverse “bad” edges). We show that the total probability thatT stops before

reaching a leaf is negligible.

2.3 Main Result

Theorem 2.3.1. Let 1100 < c < 2

3 . Fix γ to be such that3c2 < γ2 < 1. Let X, A be two finite sets. Let

n = log2 |X|. Let M : A × X → {−1, 1} be a matrix which is a (k′, ℓ′)-L2-extractor with error

2−r′ , for sufficiently large1 k′, ℓ′ and r′, where ℓ′ ≤ n. Let

r := min{

r′2 ,

(1−γ)k′2 ,

(1−γ)ℓ′2 − 1

}. (2.1)

Let B be a branching program of length atmost 2r andwidth atmost 2c·k′·ℓ′ for the learning problem

that corresponds to the matrixM. Then, the success probability of B is at most O(2−r).

Proof. Let

k := γk′ and ℓ := γℓ′/3. (2.2)

Note that by the assumption that k′, ℓ′ and r′ are sufficiently large, we get that k, ℓ and r are also

sufficiently large. Since ℓ′ ≤ n, we have ℓ+ r ≤ γℓ′3 +

(1−γ)ℓ′2 < ℓ′

2 ≤ n2 . Thus,

r < n/2− ℓ. (2.3)1By “sufficiently large”we mean that k′, ℓ′, r′ are larger than some constant that depends on γ.

30

LetBbe a branching programof lengthm = 2r andwidthd = 2c·k′·ℓ′ for the learning problem

that corresponds to the matrix M. We will show that the success probability of B is at most

O(2−r).

2.3.1 The Truncated-Path and Additional Definitions andNotation

We will define the truncated-path, T , to be the same as the computation-path of B, except that

it sometimes stops before reaching a leaf. Formally, we define T , together with several other

definitions and notations, by induction on the layers of the branching program B.

Assume thatwe already defined the truncated-pathT , until it reaches layer-iofB. For a vertex v

in layer-i ofB, letEv be the event thatT reaches the vertex v. For simplicity, we denote byPr(v) =


Px|v = Px|Ev the distribution of the random variable x conditioned on the event Ev.

There will be three cases in which the truncated-path T stops on a non-leaf v:

1. If v is a, so called, significant vertex,where the ℓ2 normofPx|v is non-negligible. (Intuitively,

this means that conditioned on the event that T reaches v, a non-negligible amount of

information is known about x).

2. If Px|v(x) is non-negligible. (Intuitively, this means that conditioned on the event that T

reaches v, the correct element x could have been guessedwith a non-negligible probability).

3. If (M · Px|v)(ai+1) is non-negligible. (Intuitively, this means that T is about to traverse a

“bad” edge, which is traversedwith a non-negligibly higher or lower probability than other

edges).

Next, we describe these three cases more formally.

31

Significant Vertices

We say that a vertex v in layer-i of B is significant if

∥Px|v∥2 > 2ℓ · 2−n.

Significant Values

Even if v is not significant,Px|vmayhave relatively large values. For a vertex v in layer-iofB, denote

by Sig(v) the set of all x′ ∈ X, such that,

Px|v(x′) > 22ℓ+2r · 2−n.

Bad Edges

For a vertex v in layer-i of B, denote by Bad(v) the set of all α ∈ A, such that,

∣∣(M · Px|v)(α)∣∣ ≥ 2−r′ .

The Truncated-Path T

We define T by induction on the layers of the branching program B. Assume that we already

defined T until it reaches a vertex v in layer-i of B. The path T stops on v if (at least) one of the

following occurs:

1. v is significant.

2. x ∈ Sig(v).

3. ai+1 ∈ Bad(v).

4. v is a leaf.

32

Otherwise, T proceeds by following the edge labeled by (ai+1, bi+1) (same as the computational-

path).

2.3.2 Proof of Theorem 2.3.1

Since T follows the computation-path of B, except that it sometimes stops before reaching a

leaf, the success probability of B is bounded (from above) by the probability that T stops before

reaching a leaf, plus the probability that T reaches a leaf v and x(v) = x.

The main lemma needed for the proof of Theorem 2.3.1 is Lemma 2.3.1 that shows that the

probability that T reaches a significant vertex is at mostO(2−r).

Lemma 2.3.1. The probability that T reaches a significant vertex is at most O(2−r).

Lemma 2.3.1 is proved in Section 2.3.3. We will now show how the proof of Theorem 2.3.1

follows from that lemma.

Lemma 2.3.1 shows that the probability that T stops on a non-leaf vertex, because of the first

reason (i.e., that the vertex is significant), is small. The next two lemmas imply that the probabil-

ities that T stops on a non-leaf vertex, because of the second and third reasons, are also small.

Claim 2.3.1. If v is a non-significant vertex of B then

Prx[x ∈ Sig(v) | Ev] ≤ 2−2r.

Proof. Since v is not significant,

Ex′∼Px|v

[Px|v(x′)

]=∑x′∈X

[Px|v(x′)2

]= 2n · E

x′∈RX

[Px|v(x′)2

]≤ 22ℓ · 2−n.

Hence, byMarkov’s inequality,

Prx′∼Px|v

[Px|v(x′) > 22r · 22ℓ · 2−n

]≤ 2−2r.

33

Since conditioned on Ev, the distribution of x is Px|v, we obtain

Prx

[x ∈ Sig(v)

∣∣ Ev]= Pr

x

[(Px|v(x) > 22r · 22ℓ · 2−n

) ∣∣ Ev]≤ 2−2r.


Prai+1

[ai+1 ∈ Bad(v)] ≤ 2−2r.

Proof. Since v is not significant, ∥Px|v∥2 ≤ 2ℓ · 2−n. Since Px|v is a distribution, ∥Px|v∥1 = 2−n.

Thus,∥Px|v∥2∥Px|v∥1

≤ 2ℓ ≤ 2ℓ′ .

SinceM is a (k′, ℓ′)-L2-extractor with error 2−r′ , there are at most 2−k′ · |A| elements α ∈ Awith

∣∣⟨Mα,Px|v⟩∣∣ ≥ 2−r′ · ∥Px|v∥1 = 2−r′ · 2−n

The claim follows since ai+1 is uniformly distributed over A and since k′ ≥ 2r (Equation (2.1)).

We cannowuseLemma2.3.1,Claim2.3.1 andClaim2.3.2 toprove that theprobability thatT

stops before reaching a leaf is at most O(2−r). Lemma 2.3.1 shows that the probability that T

reaches a significant vertex and hence stops because of the first reason, is at mostO(2−r). Assum-

ing that T doesn’t reach any significant vertex (in which case it would have stopped because of

the first reason), Claim 2.3.1 shows that in each step, the probability that T stops because of the

second reason, is at most 2−2r. Taking a union bound over the m = 2r steps, the total proba-

bility that T stops because of the second reason, is at most 2−r. In the same way, assuming that

T doesn’t reach any significant vertex (in which case it would have stopped because of the first

reason), Claim 2.3.2 shows that in each step, the probability that T stops because of the third

reason, is at most 2−2r. Again, taking a union bound over the 2r steps, the total probability that

34

T stops because of the third reason, is at most 2−r. Thus, the total probability that T stops (for

any reason) before reaching a leaf is at mostO(2−r).

Recall that if T doesn’t stop before reaching a leaf, it just follows the computation-path of B.

Recall also that byLemma2.3.1, theprobability thatT reaches a significant leaf is atmostO(2−r).

Thus, to bound (from above) the success probability of B by O(2−r), it remains to bound the

probability that T reaches a non-significant leaf v and x(v) = x. Claim 2.3.3 shows that for any

non-significant leaf v, conditioned on the event that T reaches v, the probability for x(v) = x is

at most 2−r, which completes the proof of Theorem 2.3.1.

Claim 2.3.3. If v is a non-significant leaf of B then

Pr[x(v) = x | Ev] ≤ 2−r.


Ex′∈RX

[Px|v(x′)2

]≤ 22ℓ · 2−2n.

Hence, for every x′ ∈ X,

Pr[x = x′ | Ev] = Px|v(x′) ≤ 2ℓ · 2−n/2 ≤ 2−r

since r ≤ n/2− ℓ (Equation (2.3)). In particular,

Pr[x(v) = x | Ev] ≤ 2−r.

This completes the proof of Theorem 2.3.1.

35

2.3.3 Proof of Lemma 2.3.1

Proof. We need to prove that the probability that T reaches any significant vertex is at most

O(2−r). Let s be a significant vertex of B. We will bound from above the probability that T

reaches s, and then use a union bound over all significant vertices of B. Interestingly, the upper

bound on the width of B is used only in the union bound.

The Distributions Px|v and Px|e

Recall that for a vertex v ofB, we denote byEv the event thatT reaches the vertex v. For simplicity,

we denote byPr(v) = Pr(Ev) the probability forEv (where the probability is over x, a1, . . . , am),

andwe denote byPx|v = Px|Ev the distribution of the random variable x conditioned on the event

Ev.

Similarly, for an edge e of the branching program B, let Ee be the event that T traverses the

edge e. Denote, Pr(e) = Pr(Ee) (where the probability is over x, a1, . . . , am), and Px|e = Px|Ee .

Claim 2.3.4. For any edge e = (v, u) of B, labeled by (a, b), such thatPr(e) > 0, for any x′ ∈ X,

Px|e(x′) =

0 if x′ ∈ Sig(v) or M(a, x′) = b

Px|v(x′) · c−1e if x′ ∈ Sig(v) and M(a, x′) = b

where ce is a normalization factor that satisfies,

ce ≥ 12 − 2 · 2−2r.

Proof. Let e = (v, u) be an edge of B, labeled by (a, b), and such thatPr(e) > 0. SincePr(e) >

0, the vertex v is not significant (as otherwise T always stops on v and hence Pr(e) = 0). Also,

since Pr(e) > 0, we know that a ∈ Bad(v) (as otherwise T never traverses e and hence Pr(e) =

0).

If T reaches v, it traverses the edge e if and only if: x ∈ Sig(v) (as otherwise T stops on v) and

36

M(a, x) = b and ai+1 = a. Therefore, for any x′ ∈ X,

Px|e(x′) =



where ce is a normalization factor, given by

ce =∑

{x′ : x′ ∈Sig(v) ∧M(a,x′)=b}

Px|v(x′) = Prx[(x ∈ Sig(v)) ∧ (M(a, x) = b) | Ev].

Since v is not significant, by Claim 2.3.1,

Prx[x ∈ Sig(v) | Ev] ≤ 2−2r.

Since a ∈ Bad(v),

∣∣∣Prx[M(a, x) = 1 | Ev]− Pr

x[M(a, x) = −1 | Ev]

∣∣∣ = ∣∣(M · Px|v)(a)∣∣ ≤ 2−r′ ,

and hence

Prx[M(a, x) = b | Ev] ≤ 1

2 + 2−r′ .

Hence, by the union bound,

ce = Prx[(x ∈ Sig(v)) ∧ (M(a, x) = b) | Ev] ≥ 1

2 − 2−r′ − 2−2r ≥ 12 − 2 · 2−2r

(where the last inequality follows since r ≤ r′/2, by Equation (2.1)).

Bounding the Norm of Px|s

Wewill show that ∥Px|s∥2 cannot be too large. Towards this, wewill first prove that for every edge

e of B that is traversed by T with probability larger than zero, ∥Px|e∥2 cannot be too large.

37

Claim 2.3.5. For any edge e of B, such that Pr(e) > 0,

∥Px|e∥2 ≤ 4 · 2ℓ · 2−n.

Proof. Let e = (v, u) be an edge of B, labeled by (a, b), and such thatPr(e) > 0. SincePr(e) >

0, the vertex v is not significant (as otherwise T always stops on v and hence Pr(e) = 0). Thus,

∥Px|v∥2 ≤ 2ℓ · 2−n.

By Claim 2.3.4, for any x′ ∈ X,

Px|e(x′) =



where ce satisfies,

ce ≥ 12 − 2 · 2−2r > 1

4

(where the last inequality holds because we assume that k′, ℓ′, r′ and thus r are sufficiently large.)

Thus,

∥Px|e∥2 ≤ c−1e · ∥Px|v∥2 ≤ 4 · 2ℓ · 2−n

Claim 2.3.6.

∥Px|s∥2 ≤ 4 · 2ℓ · 2−n.

Proof. Let Γin(s) be the set of all edges e of B, that are going into s, such that Pr(e) > 0. Note

that ∑e∈Γin(s)

Pr(e) = Pr(s).

38

By the law of total probability, for every x′ ∈ X,

Px|s(x′) =∑

e∈Γin(s)

Pr(e)Pr(s) · Px|e(x′),

and hence by Jensen’s inequality,

Px|s(x′)2 ≤∑

e∈Γin(s)

Pr(e)Pr(s) · Px|e(x′)2.

Summing over x′ ∈ X, we obtain,

∥Px|s∥22 ≤∑

e∈Γin(s)

Pr(e)Pr(s) · ∥Px|e∥22.

By Claim 2.3.5, for any e ∈ Γin(s),

∥Px|e∥22 ≤(4 · 2ℓ · 2−n

)2.

Hence,

∥Px|s∥22 ≤(4 · 2ℓ · 2−n

)2.

Similarity to a Target Distribution

Recall that for two functions f, g : X → R+, we defined

⟨f, g⟩ = Ez∈RX

[f(z) · g(z)].

We think of ⟨f, g⟩ as a measure for the similarity between a function f and a target function g.

Typically f, gwill be distributions.

Claim 2.3.7.

⟨Px|s,Px|s⟩ > 22ℓ · 2−2n.

39

Proof. Since s is significant,

⟨Px|s,Px|s⟩ = ∥Px|s∥22 > 22ℓ · 2−2n.

Claim 2.3.8.

⟨UX,Px|s⟩ = 2−2n,

where UX is the uniform distribution over X.

Proof. Since Px|s is a distribution,

⟨UX,Px|s⟩ = 2−2n ·∑z∈X

Px|s(z) = 2−2n.

Measuring the Progress

For i ∈ {0, . . . ,m}, let Li be the set of vertices v in layer-i of B, such that Pr(v) > 0. For

i ∈ {1, . . . ,m}, let Γi be the set of edges e from layer-(i − 1) of B to layer-i of B, such that

Pr(e) > 0. Recall that k = γk′ (Equation (2.2)).

For i ∈ {0, . . . ,m}, let

Zi =∑v∈Li


For i ∈ {1, . . . ,m}, let

Z ′i =∑e∈Γi


We think ofZi,Z ′i as measuring the progress made by the branching program, towards reach-

ing a state with distribution similar to Px|s.

For a vertex v of B, let Γout(v) be the set of all edges e of B, that are going out of v, such that

Pr(e) > 0. Note that ∑e∈Γout(v)

Pr(e) ≤ Pr(v).

(We don’t always have an equality here, since sometimes T stops on v).

40

The next four claims show that the progress made by the branching program is slow.

Claim 2.3.9. For every vertex v of B, such that Pr(v) > 0,

∑e∈Γout(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩k ≤ ⟨Px|v,Px|s⟩k ·

(1+ 2−r

)k+(2−2n+2)k .

Proof. If v is significant or v is a leaf, thenT always stops on v and hence Γout(v) is empty and thus

the left hand side is equal to zero and the right hand side is positive, so the claim follows trivially.

Thus, we can assume that v is not significant and is not a leaf.

Define P : X → R+ as follows. For any x′ ∈ X,

P(x′) =

0 if x′ ∈ Sig(v)

Px|v(x′) if x′ ∈ Sig(v)

Note that by the definition of Sig(v), for any x′ ∈ X,

P(x′) ≤ 22ℓ+2r · 2−n. (2.4)

Define f : X → R+ as follows. For any x′ ∈ X,

f(x′) = P(x′) · Px|s(x′).

By Claim 2.3.6 and Equation (2.4),

∥f∥2 ≤ 22ℓ+2r · 2−n · ∥Px|s∥2 ≤ 22ℓ+2r · 2−n · 4 · 2ℓ · 2−n = 23ℓ+2r+2 · 2−2n. (2.5)

By Claim 2.3.4, for any edge e ∈ Γout(v), labeled by (a, b), for any x′ ∈ X,

Px|e(x′) =

0 if M(a, x′) = b

P(x′) · c−1e if M(a, x′) = b

41

where ce satisfies,

ce ≥ 12 − 2 · 2−2r.

Therefore, for any edge e ∈ Γout(v), labeled by (a, b), for any x′ ∈ X,

Px|e(x′) · Px|s(x′) =

0 if M(a, x′) = b

f(x′) · c−1e if M(a, x′) = b

and hence, we have

⟨Px|e,Px|s⟩ = Ex′∈RX

[Px|e(x′) · Px|s(x′)] = Ex′∈RX

[f(x′) · c−1e · 1{x′∈X : M(a,x′)=b}]

= Ex′∈RX

[f(x′) · c−1e · (1+b·M(a,x′))

2

]= (∥f∥1 + b · ⟨Ma, f⟩) · (2ce)−1

< (∥f∥1 + |⟨Ma, f⟩|) ·(1+ 2−2r+3) (2.6)

(where the last inequality holds by the bound that we have on ce, because we assume that k′, ℓ′, r′

and thus r are sufficiently large).

We will now consider two cases:

Case I: ∥f∥1 < 2−2n

In this case, we bound |⟨Ma, f⟩| ≤ ∥f∥1 (since f is non-negative and the entries of M are

in {−1, 1}) and (1+ 2−2r+3) < 2 (since we assume that k′, ℓ′, r′ and thus r are sufficiently large)

and obtain for any edge e ∈ Γout(v),

⟨Px|e,Px|s⟩ < 4 · 2−2n.

Since∑

e∈Γout(v)Pr(e)Pr(v) ≤ 1, Claim 2.3.9 follows, as the left hand side of the claim is smaller than

the second term on the right hand side.

42

Case II: ∥f∥1 ≥ 2−2n

For every a ∈ A, define

t(a) =|⟨Ma, f⟩|∥f∥1

.

By Equation (2.6),

⟨Px|e,Px|s⟩k < ∥f∥k1 · (1+ t(a))k ·(1+ 2−2r+3)k . (2.7)

Note that by the definitions of P and f,

∥f∥1 = Ex′∈RX

[f(x′)] = ⟨P,Px|s⟩ ≤ ⟨Px|v,Px|s⟩.

Note also that for every a ∈ A, there is at most one edge e(a,1) ∈ Γout(v), labeled by (a, 1), and at

most one edge e(a,−1) ∈ Γout(v), labeled by (a,−1), and we have

Pr(e(a,1))Pr(v) +

Pr(e(a,−1))

Pr(v) ≤ 1|A| ,

since 1|A| is the probability that the next sample read by the program is a. Thus, summing over all

e ∈ Γout(v), by Equation (2.7),

∑e∈Γout(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩k < ⟨Px|v,Px|s⟩k · E

a∈RA

[(1+ t(a))k

]·(1+ 2−2r+3)k . (2.8)

It remains to bound

Ea∈RA

[(1+ t(a))k

], (2.9)

using the properties of the matrixM and the bounds on the ℓ2 versus ℓ1 norms of f.

By Equation (2.5), the assumption that ∥f∥1 ≥ 2−2n, Equation (2.1) and Equation (2.2), we

get∥f∥2∥f∥1

≤ 23ℓ+2r+2 ≤ 2ℓ′ .

43

SinceM is a (k′, ℓ′)-L2-extractor with error 2−r′ , there are at most 2−k′ · |A| rows a ∈ A with

t(a) = |⟨Ma,f⟩|∥f∥1 ≥ 2−r′ . We bound the expectation in Equation (2.9), by splitting the expectation

into two sums

Ea∈RA

[(1+ t(a))k

]= 1|A| ·

∑a : t(a)≤2−r′

(1+ t(a))k + 1|A| ·

∑a : t(a)>2−r′

(1+ t(a))k . (2.10)

We bound the first sum in Equation (2.10) by (1 + 2−r′)k. As for the second sum in Equa-

tion (2.10), we know that it is a sum of at most 2−k′ · |A| elements, and since for every a ∈ A, we

have t(a) ≤ 1, we have

1|A| ·

∑a : t(a)>2−r′

(1+ t(a))k ≤ 2−k′ · 2k ≤ 2−2r

(where in the last inequality we used Equations (2.1) and (2.2)). Overall, using Equation (2.1)

again, we get

Ea∈RA

[(1+ t(a))k

]≤ (1+ 2−r′)k + 2−2r ≤ (1+ 2−2r)k+1. (2.11)

Substituting Equation (2.11) into Equation (2.8), we obtain

∑e∈Γout(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩k < ⟨Px|v,Px|s⟩k ·

(1+ 2−2r

)k+1 ·(1+ 2−2r+3)k

< ⟨Px|v,Px|s⟩k ·(1+ 2−r

)k(where the last inequality uses the assumption that r is sufficiently large). This completes the

proof of Claim 2.3.9.

Claim 2.3.10. For every i ∈ {1, . . . ,m},

Z ′i ≤ Zi−1 ·(1+ 2−r

)k+(2−2n+2)k .

44

Proof. By Claim 2.3.9,

Z ′i =∑e∈Γi

Pr(e) · ⟨Px|e,Px|s⟩k =∑v∈Li−1

Pr(v) ·∑

e∈Γout(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩k

≤∑v∈Li−1

Pr(v) ·(⟨Px|v,Px|s⟩k ·

(1+ 2−r

)k+(2−2n+2)k)

= Zi−1 ·(1+ 2−r

)k+∑v∈Li−1

Pr(v) ·(2−2n+2)k

≤ Zi−1 ·(1+ 2−r

)k+(2−2n+2)k


Zi ≤ Z ′i .

Proof. For any v ∈ Li, let Γin(v) be the set of all edges e ∈ Γi, that are going into v. Note that

∑e∈Γin(v)

Pr(e) = Pr(v).

By the law of total probability, for every v ∈ Li and every x′ ∈ X,

Px|v(x′) =∑

e∈Γin(v)

Pr(e)Pr(v) · Px|e(x′),

and hence

⟨Px|v,Px|s⟩ =∑

e∈Γin(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩.

Thus, by Jensen’s inequality,

⟨Px|v,Px|s⟩k ≤∑

e∈Γin(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩k.

45

Summing over all v ∈ Li, we get

Zi =∑v∈Li

Pr(v) · ⟨Px|v,Px|s⟩k ≤∑v∈Li

Pr(v) ·∑

e∈Γin(v)


=∑e∈Γi

Pr(e) · ⟨Px|e,Px|s⟩k = Z ′i .


Zi ≤ 24k+2r · 2−2k·n.

Proof. By Claim 2.3.8, Z0 = (2−2n)k. By Claim 2.3.10 and Claim 2.3.11, for every i ∈

{1, . . . ,m},

Zi ≤ Zi−1 ·(1+ 2−r

)k+(2−2n+2)k .

Hence, for every i ∈ {1, . . . ,m},

Zi ≤(2−2n+2)k ·m ·

(1+ 2−r

)km.

Sincem = 2r,

Zi ≤ 2−2k·n · 22k · 2r · ek ≤ 2−2k·n · 24k+2r.

Proof of Lemma 2.3.1

We can now complete the proof of Lemma 2.3.1. Assume that s is in layer-i of B. By Claim 2.3.7,

Zi ≥ Pr(s) · ⟨Px|s,Px|s⟩k > Pr(s) ·(22ℓ · 2−2n

)k= Pr(s) · 22ℓ·k · 2−2k·n.

On the other hand, by Claim 2.3.12,

Zi ≤ 24k+2r · 2−2k·n.

46

Thus, using Equation (2.1) and Equation (2.2), we get

Pr(s) ≤ 24k+2r · 2−2ℓ·k ≤ 24k′ · 2−(2γ2/3)·(k′ℓ′).

Recall that we assumed that thewidth ofB is atmost 2ck′ℓ′ for some constant c < 2/3, and that

the length of B is at most 2r. Recall that we fixed γ such that 2γ2/3 > c. Taking a union bound

over at most 2r · 2ck′ℓ′ ≤ 2k′ · 2ck′ℓ′ significant vertices of B, we conclude that the probability that

T reaches any significant vertex is at most 2−Ω(k′ℓ′). Since we assume that k′ and ℓ′ are sufficiently

large, 2−Ω(k′ℓ′) is certainly at most 2−k′ , which is at most 2−r.

2.3.4 Lower Bounds forWeak Learning

In this section, we show that under the same conditions of Theorem 2.3.1, the branching pro-

gram cannot even weakly-learn the function. That is, we show that the branching program can-

not output a hypothesis h : A → {−1, 1} with a non-negligible correlation with the function

defined by the true unknown x. We change the definition of the branching program and asso-

ciate with each leaf v a hypothesis hv : A → {−1, 1}. We measure the success as the correlation

between hv and the function defined by the true unknown x.

Formally, for any x ∈ X, let M(x) : A → {−1, 1} be the function corresponding to the x-

th column ofM. We define the value of the program as E[∣∣⟨hv,M(x)⟩

∣∣], where the expectationis over x, a1, . . . , am (recall that x is uniformly distributed over X and a1, . . . , am are uniformly

distributed over A, and for every t, bt = M(at, x)). The following claim bounds the expected

correlation between hv andM(x), conditioned on reaching a non-significant leaf.

Claim 2.3.13. If v is a non-significant leaf, then

Ex

[ ∣∣⟨hv,M(x)⟩∣∣ ∣∣∣ Ev

]≤ O(2−r/2).

47

Proof. We expand the expected correlation between hv andM(x), squared:

Ex

[ ∣∣⟨hv,M(x)⟩∣∣ ∣∣∣ Ev

]2≤ E

x

[⟨hv,M(x)⟩2

∣∣∣ Ev

]=∑x′∈X

Px|v(x′) · ⟨hv,M(x′)⟩2

=∑x′∈X

Px|v(x′) · Ea,a′∈RA

[hv(a) ·M(a, x′) · hv(a′) ·M(a′, x′)]

= Ea,a′∈RA

[hv(a) · hv(a′) ·

∑x′∈X

Px|v(x′) ·M(a, x′) ·M(a′, x′)]

≤ Ea,a′∈RA

[∣∣∣∣∣∑x′∈X

Px|v(x′) ·M(a, x′) ·M(a′, x′)

∣∣∣∣∣]

= Ea∈RA

[E

a′∈RA

[∣∣∣∣∣∑x′∈X

Px|v(x′) ·M(a, x′) ·M(a′, x′)

∣∣∣∣∣]]

.

Next, we show that Ea′∈RA[∣∣∑

x′∈X Px|v(x′) ·M(a, x′) ·M(a′, x′)∣∣] ≤ 4 · 2−r for any a ∈ A.

Fix a ∈ A. Let qa : X → R be the function defined by qa(x′) = Px|v(x′) ·M(a, x′) for x′ ∈ X.

Since |qa(x′)| = |Px|v(x′)| for any x′ ∈ X and since v is a non-significant vertex, we get

∥qa∥2 = ∥Px|v∥2 ≤ 2ℓ · 2−n and ∥qa∥1 = ∥Px|v∥1 = 2−n.

Hence, ∥qa∥2∥qa∥1 ≤ 2ℓ. We would like to use the fact that M is a (k′, ℓ′)-L2-extractor with error

2−r′ to show that there aren’t many rows ofM with a large inner product with qa. However, qa

can get negative values and the definition of L2-extractors only handles non-negative functions

f : X → R+. To solve this issue, we use the following lemma, proved in Section 2.4.1.

Lemma 2.3.2. Suppose thatM : A×X → {−1, 1} is a (k′, ℓ′)-L2-extractorwith error atmost2−r.

Let f : X → R be any function (i.e., f can get negative values) with ∥f∥2∥f∥1 ≤ 2ℓ′−r. Then, there are at

most 2 · 2−k′ · |A| rows a ∈ A with |⟨Ma,f⟩|∥f∥1 ≥ 2 · 2−r.

Since M is a (k′, ℓ′)-L2-extractor with error at most 2−r′ , and since r < r′, we have that M

is also a (k′, ℓ′)-L2-extractor with error at most 2−r. Since ∥qa∥2∥qa∥1 ≤ 2ℓ ≤ 2ℓ′−r, we can apply

Lemma 2.3.2 with f = qa, and error 2−r. We get that there are at most 2 · 2−k′ · |A| rows a′ ∈ A

48

with |⟨qa,Ma′ ⟩|∥qa∥1 ≥ 2 · 2−r. Thus,

Ea′∈RA

[∣∣∣∣∣∑x′∈X

qa(x′) ·M(a′, x′)

∣∣∣∣∣]= E

a′∈RA

[|⟨qa,Ma′⟩|

∥qa∥1

]≤ 2 · 2−k′ + 2 · 2−r ≤ 4 · 2−r .

Overall, we get that Ex[|⟨hv,M(x)⟩|

∣∣ Ev]2 ≤ 4 · 2−r. Taking square roots of both sides of the

last inequality completes the proof.

Lemma2.3.1,Claim2.3.1 andClaim2.3.2 show that theprobability thatT stopsbefore reach-

ing a leaf is at most O(2−r). Combining this with Claim 2.3.13 we get that (under the same

conditions of Theorem 2.3.1)

E[∣∣⟨hv,M(x)⟩

∣∣] ≤ Pr[T stops] + O(2−r/2) ≤ O(2−r/2),

where the expectation and probability are taken over x ∈R X and a1, . . . , am ∈R A. We get the

following theorem as a conclusion.

Theorem 2.3.2. Let 1100 < c < 2

3 . Fix γ to be such that3c2 < γ2 < 1.

Let X, A be two finite sets. Let n = log2 |X|. Let M : A× X → {−1, 1} be a matrix which is a

(k′, ℓ′)-L2-extractor with error 2−r′ , for sufficiently large2 k′, ℓ′ and r′, where ℓ′ ≤ n. Let

r := min{

r′2 ,

(1−γ)k′2 ,

(1−γ)ℓ′2 − 1

}.

Let B be a branching program of length at most 2r and width at most 2c·k′·ℓ′ for the learning

problem that corresponds to the matrixM. Then,

E[∣∣⟨hv,M(x)⟩

∣∣] ≤ O(2−r/2) .

In particular, the probability that the hypothesis agrees with the function defined by the true

unknown x, on more than 1/2+ 2−r/4 of the inputs, is at mostO(2−r/4).2By “sufficiently large”we mean that k′, ℓ′, r′ are larger than some constant that depends on γ.

49

2.3.5 Main Corollary

Corollary 2.3.3. There exists a sufficiently small constant c > 0 such that:

Let X, A be two finite sets. LetM : A× X → {−1, 1} be a matrix. Assume that k, ℓ, r ∈ N are

such that any submatrix ofM of at least 2−k · |A| rows and at least 2−ℓ · |X| columns, has a bias of

at most 2−r.

Let B be a branching program of length at most 2c·r and width at most 2c·k·ℓ for the learning

problem that corresponds to the matrixM. Then, the success probability of B is at most 2−Ω(r).

Proof. By Lemma 2.4.2 (stated and proved below), there exist k′ = k + Ω(r), ℓ′ = ℓ + Ω(r),

and r′ = Ω(r), such that: any submatrix ofM of at least 2−k′ · |A| rows and at least 2−ℓ′ · |X|

columns, has a bias of at most 2−r′ .

ByLemma2.4.4 (stated andprovedbelow),M is an (Ω(k)+Ω(r),Ω(ℓ)+Ω(r))-L2-extractor

with error 2−Ω(r).

The corollary follows by Theorem 2.3.1.

2.4 Applications

Before, we dive into the applications of Theorem 2.3.1 and Corollary 2.3.3, we establish certain

useful lemmas in the next section.

2.4.1 Some Useful Lemmas

Handling Negative Functions In the following lemma, we show that up to a small loss

in parameters an L2-extractor has similar guarantees for any function f : X → R with bounded

ℓ2-vs-ℓ1-norm regardless of whether or not f is non-negative.

Lemma 2.4.1. Suppose thatM : A×X → {−1, 1} is a (k′, ℓ′)-L2-extractorwith error atmost2−r.

Let f : X → R be any function with ∥f∥2∥f∥1 ≤ 2ℓ′−r. Then, there are at most 2 · 2−k′ · |A| rows a ∈ A

with |⟨Ma,f⟩|∥f∥1 ≥ 2 · 2−r.

50

Proof. Let f+, f− : X → R+ be the non-negative functions defined by

f+(x) =

f(x), f(x) > 0

0, otherwisef−(x) =

|f(x)|, f(x) < 0

0, otherwise

for x ∈ X. We have f(x) = f+(x)− f−(x) for all x ∈ X. We split into two cases:

1. If ∥f+∥1 < 2−r · ∥f∥1, then |⟨Ma, f+⟩| ≤ ∥f+∥1 < 2−r · ∥f∥1 for all a ∈ A.

2. If ∥f+∥1 ≥ 2−r · ∥f∥1, then f+ is a non-negative function with

∥f+∥2∥f+∥1

≤ ∥f∥2∥f∥1 · 2−r

≤ 2ℓ′ .

Thus, we may use the assumption that M is an L2-extractor to deduce that there are at

most 2−k′ · |A| rows a ∈ Awith |⟨Ma, f+⟩| ≥ ∥f+∥1 · 2−r.

In both cases, there are at most 2−k′ · |A| rows a ∈ A with |⟨Ma, f+⟩| ≥ ∥f∥1 · 2−r. Similarly,

there are at most 2−k′ · |A| rows a ∈ A with |⟨Ma, f−⟩| ≥ ∥f∥1 · 2−r. Thus, for all but at most

2 · 2−k′ · |A| of the rows a ∈ Awe have

|⟨Ma, f⟩| ≤ |⟨Ma, f+⟩|+ |⟨Ma, f−⟩| < 2 · ∥f∥1 · 2−r .

Error vs. Min-Entropy

Lemma 2.4.2. Let M : A× X → {−1, 1} be a matrix. Let k, ℓ, r be such that any submatrix of

M of at least 2−k · |A| rows and at least 2−ℓ · |X| columns, has a bias of at most 2−r.

Then, there exist k′ = k +Ω(r), ℓ′ = ℓ +Ω(r), and r′ = Ω(r), such that: any submatrix of

M of at least 2−k′ · |A| rows and at least 2−ℓ′ · |X| columns, has a bias of at most 2−r′ .

Proof. Assume without loss of generality that k, ℓ, r are larger than some sufficiently large abso-

lute constant.

51

Wewill show that there exists k′ = k+Ω(r), such that, any submatrix ofM of at least 2−k′ ·|A|

rows and at least 2−ℓ · |X| columns, has a bias of at most 2−Ω(r). The proof of the lemma then

follows by applying the same claim again on the transposed matrix.

Let k′ = k + r10 . Assume for a contradiction that there exist T ⊆ A of size at least 2−k′ · |A|

and S ⊆ X of size at least 2−ℓ · |X|, such that the bias of T × S is larger than, say, 2−r/2. By the

assumption of the lemma, |T| < 2−k · |A|.

Let T′ be an arbitrary set of 2−k · |A| rows inA \ T. By the assumption of the lemma, the bias

of T′ × S is at most 2−r. Therefore, the bias of (T′ ∪ T)× S is at least

|T||T′∪T| · 2

−r/2 − |T′||T′∪T| · 2

−r ≥ 12 · 2

−r/10 · 2−r/2 − 2−r > 2−r.

Thus, (T′ ∪ T)× S contradicts the assumption of the lemma.

L2-Extractors and L∞-Extractors We will show that M being an L2-Extractor is

equivalent toM being an L∞-Extractor (barring constants).

Lemma 2.4.3. If a matrix M : A × X → {−1, 1} is a (k, ℓ)-L2-Extractor with error 2−r, then

M is also a (k− ξ, 2ℓ ∼ (min{r, ξ} − 1))-L∞-Extractor, ∀0 < ξ < k.

Taking ξ = k2 , we get that if M is a (k, ℓ)-L2-Extractor with error 2−r, then M is also a

(Ω(k),Ω(ℓ) ∼ (Ω(min{r, k})))-L∞-Extractor.

Proof. We pick a ξ (0 < ξ < k). To prove that M is a (k− ξ, 2ℓ ∼ (min{r, ξ} − 1))-L∞-

Extractor, it suffices to prove the statement of the L∞-Extractors for any two uniform distribu-

tions over subsetsA1 ⊆ A and X1 ⊆ X of size at least |A|2k−ξ and |X|22ℓ respectively. This follows from

the fact that any distribution withmin-entropy at least h can be written as a convex combination

of uniform distributions on sets of size at least 2h [CG88].

For a distribution px, which is uniform over a subset X1 ⊆ X of size at least |X|22ℓ ,

∥px∥2∥px∥1

=

(|X||X1|

) 12

≤ 2ℓ.

52

Using the fact thatM is a (k, ℓ)-L2-Extractor with error 2−r, we know that there are at most |A|2k

rows a with |(M · px)a| ≥ 2−r. Using the fact that pa is a uniform distribution over a set A1 of

size at least |A|2k−ξ , we get

∣∣∣∣∣∑a′∈A

∑x′∈X

pa(a′) · px(x′) ·M(a′, x′)

∣∣∣∣∣ ≤ 1|A1|

·∑a′∈A1

|(M · px)a′ |

≤ 1|A1|

·(|A|2k

+ |A1| · 2−r)

≤ 2−ξ + 2−r

This proves thatM is a (k− ξ, 2ℓ ∼ (min{r, ξ} − 1))-L∞-Extractor, ∀0 < ξ < k.

Lemma 2.4.4. If a matrixM : A×X → {−1, 1} is a (k, ℓ ∼ r)-L∞-Extractor, thenM is also a(k− 1, ℓ−ξ−12

)-L2-Extractor with error 2−r + 2−ξ+1, ∀1 ≤ ξ ≤ ℓ− 1.

Taking ξ = ℓ2 , we get that ifM is a (k, ℓ ∼ r)-L∞-Extractor, thenM is also a (Ω(k),Ω(ℓ))-

L2-Extractor with error 2−Ω(min{r,ℓ}).

In this proof, we use the following notation. For two non-negative functions P,Q : X → R,

we denote by dist(P,Q) the ℓ1-distance between the two functions, that is

dist(P,Q) =∑x∈X

|P(x)− Q(x)| .

Note that dist(P,Q) = ∥P− Q∥1 · |X|.

Proof. We want to prove that for any 1 ≤ ξ ≤ ℓ− 1, and any non-negative function f : X → R

with ∥f∥2∥f∥1 ≤ 2ℓ−ξ−1

2 , there are at most 2 · 2−k · |A| rows a ∈ Awith |⟨Ma,f⟩|∥f∥1 ≥ 2−r + 2−ξ+1.

Let’s assume that there exists a non-negative function f : X → R for which the last statement

is not true. Let fp be a probability distribution on X defined by fp(x) = f(x)∑x f(x)

=f(x)|X|·∥f∥1 . Then,

∥fp∥2 =∥f∥2

|X| · ∥f∥1≤ 2

ℓ−ξ−12

|X|

=⇒(∑

x fp(x)2

|X|

) 12

≤ 2ℓ−ξ−1

2

|X|

53

=⇒∑x

fp(x)2 ≤ 2ℓ−ξ−1−log(|X|)

Thus, there is strictly less than 2−ξ probability mass on elements xwith fp(x) > 2ℓ−log(|X|)−1. Let

fp : X → R be the trimmed function that takes values fp(x) at x when fp(x) ≤ 2ℓ−log(|X|)−1 and

0 otherwise. We define a new probability distribution px : X → [0, 1] as

px(x′) = fp(x′) +1−

∑x′ fp(x′)

|X|.

Informally, we are just redistributing the probability mass removed from fp. It is easy to see that

the new probability distribution px has min-entropy at least log(|X|)− ℓ, and

dist(px, fp) < 2−ξ+1 (2.12)

as dist(px, fp) ≤ dist(px, fp) + dist(fp, fp) < 2−ξ + 2−ξ.

LetAbad be the set of rowsa ∈ Awith |⟨Ma,f⟩|∥f∥1 = |(M·fp)a| ≥ 2−r+2−ξ+1. By our assumption,

|Abad| ≥ 2 · 2−k|A|. Let A1 and A2 be the set of rows a with (M · fp)a ≥ 2−r + 2−ξ+1 and

(M · fp)a ≤ −(2−r+2−ξ+1) respectively. AsAbad = A1∪A2, w.l.o.g. |A1| ≥ |Abad|/2 ≥ 2−k|A|

(else we can work with A2 and the rest of the argument follows similarly). Let pa be a uniform

probability distribution over the set A1. Clearly pa has min-entropy at least log(|A|)− k.

As (M · fp)a ≥ 2−r + 2−ξ+1 for the entire support of pa, we get

∣∣∣∣ Ea∈RA1

[(M · fp)a]∣∣∣∣ ≥ 2−r + 2−ξ+1. (2.13)

As the entries ofM have magnitude at most 1, we have

∣∣∣∣ Ea∈RA1

[(M · (px − fp))a

]∣∣∣∣ ≤ Ea∈RA1

[∑x′∈X

|px(x′)− fp(x′)|

]= dist(px, fp) . (2.14)

54

Combining Equations (2.12), (2.13) and (2.14) together gives

∣∣∣∣ Ea∈RA1

[(M · px)a]∣∣∣∣ ≥ 2−r + 2−ξ+1 − dist(px, fp) > 2−r

Thus, we have two distributions pa and pxwithmin-entropy at least log(|A|)−k and log(|X|)−ℓ

respectively contradicting the fact thatM is a (k, ℓ ∼ r)-L∞-Extractor. Hence no such f exists

andM is a (k− 1, ℓ−ξ−12 )-L2-Extractor with error 2−r + 2−ξ+1.

Transpose

Lemma 2.4.5. If a matrix M : A × X → {−1, 1} is a (k, ℓ)-L2-Extractor with error 2−r, then

the transposed matrixMt is an (Ω(ℓ),Ω(k))-L2-Extractor with error 2−Ω(min{r,k}).

Proof. As M is a (k, ℓ)-L2-Extractor with error 2−r, using Lemma 2.4.3, M is also a

(Ω(k),Ω(ℓ) ∼ (Ω(min{r, k})))-L∞-Extractor. The definition of L∞-Extractor is symmetric

in its rows and columns and hence,Mt is also a (Ω(ℓ),Ω(k) ∼ (Ω(min{r, k})))-L∞-Extractor.

Now, using Lemma 2.4.4 onMt, we get thatMt is also a (Ω(ℓ),Ω(k))-L2-Extractor with error

2−Ω(min{r,k}).

Lower Bounds for Almost Orthogonal Vectors We show that a matrixM : A×

X → {−1, 1}whose rows are almost orthogonal is a good L2-extractor. A similar technique was

used in many previous works (see for example [GS71, CG88, Alo95, Raz05]). Motivated by the

applications (e.g., learning sparse parities and learning from low-degree equations) inwhich some

pairs of rows are not almost orthogonal, we relax this notion and only require that almost all pairs

of rows are almost orthogonal. We formalize this in the definition of (ε, δ)-almost orthogonal

vectors.

Definition 2.4.1. (ε, δ)-almost orthogonal vectors: Vectors v1, . . . , vm ∈ {−1, 1}X are (ε, δ)-

almost orthogonal if for any i ∈ [m] there are at most δ ·m indices j ∈ [m] with∣∣⟨vi, vj⟩∣∣ > ε.

Definition 2.4.1 generalizes the definition of an (ε, δ)-biased set from [KRT17].

55

Definition 2.4.2. (ε, δ)-biased set ([KRT17]): A set T ⊆ {0, 1}n is (ε, δ)-biased if there are at

most δ · 2n elements a ∈ {0, 1}n with |Ex∈RT[(−1)a·x]| > ε, (where a · x denotes the inner product

of a and x, modulo 2).

Definition 2.4.2 is a special case of Definition 2.4.1, where the vectors corresponding to a set

T ⊆ {0, 1}n are defined as follows. With every a ∈ {0, 1}n, we associate the vector va of length

|T|, whose x-th entry equals (−1)a·x for any x ∈ T. Indeed, T is (ε, δ)-biased iff the vectors

{va : a ∈ {0, 1}n} are (ε, δ)-almost orthogonal.

Lemma 2.4.6 (Generalized Johnson’s Bound). Let M ∈ {−1, 1}A×X be a matrix. Assume that

{Ma}a∈A are (ε, δ)-almost orthogonal vectors. Then, for any γ >√ε and any non-negative func-

tion f : X → R+, we have at most ( δγ2−ε) · |A| rows a ∈ A with

|⟨Ma, f⟩| ≥ γ · ∥f∥2.

In particular, fixing γ =√

ε+ δ1/2, we have thatM is a (k, ℓ)-L2-extractor with error 2−r, for

k = 12 log(1/δ), and ℓ = r = Ω

(min{log(1/ε), log(1/δ)}

).

Proof. Fix γ >√ε. Let I+ (respectively, I−) be the rows inAwith high correlation (respectively,

anti-correlation) with f. More precisely:

I+ := {i ∈ A : ⟨Mi, f⟩ > γ · ∥f∥2} ,

I− := {i ∈ A : −⟨Mi, f⟩ > γ · ∥f∥2} .

Let I = I+ ∪ I−. Define z =∑

i∈I+ Mi −∑

i∈I− Mi. We consider the inner product of f and z.

We have

(|I| · γ · ∥f∥2)2 < ⟨f, z⟩2 =

(E

x∈RX

[f(x) ·

(∑i∈I+

Mi,x −∑i∈I−

Mi,x

)])2

≤ Ex∈RX

[f(x)2

]· Ex∈RX

[(∑i∈I+

Mi,x −∑i∈I−

Mi,x

)2](Cauchy-Schwarz)

56

≤ ∥f∥22 ·∑i∈I

∑i′∈I

|⟨Mi,Mi′⟩|.

For any fixed i ∈ I, we break the inner-sum∑

i′∈I |⟨Mi,Mi′⟩| according to whether or not

|⟨Mi,Mi′⟩| > ε. By the assumption onM, there are at most δ · |A| rows i′ for which the inner-

product is larger than ε. For these rows, the inner-product is at most 1. Thus, we get

(|I| · γ · ∥f∥2)2 < ∥f∥22 ·∑i∈I

∑i′∈I

|⟨Mi,Mi′⟩| ≤ ∥f∥22 · |I| · (|A| · δ+ ε · |I|).

That is,

|I| · γ2 < |A| · δ+ ε · |I|.

Rearranging gives

|I| <(

δγ2 − ε

)· |A|,

which completes the first part of the proof.

We turn to the in particular part. Assume that ∥f∥2∥f∥1 ≤ 2ℓ. Thus, we proved that there are at

most(

δγ2−ε

)· |A| rows a ∈ A, such that,

|⟨Ma, f⟩| ≥ γ · 2ℓ · ∥f∥1.

Fixing γ =√

ε+ δ1/2, k = log(1/δ1/2), and ℓ = r = 12 log(1/γ), we get that

M is a (k, ℓ)-L2-extractor with error 2−r (Definition 2.1.1). Finally, note that ℓ = r =

Ω(min{log(1/δ), log(1/ε)}

), which completes the proof.

2.4.2 Learning Sparse Parities

As an application of Lemma 2.4.6 and Theorem 2.3.1, we reprove the main result in [KRT17].

Lemma 2.4.7. Let T ⊆ {0, 1}n be an (ε, δ)-biased set, with ε ≥ δ. Define the matrix M :

57

{0, 1}n ×T → {−1, 1} byM(a, x) = (−1)a·x. Then, the learning task associated withM (“parity

learning over T”) requires either at leastΩ(log(1/ε) · log(1/δ))memory bits or at least poly(1/ε)

samples.

Proof. The rows {Ma}a∈{0,1}n are (ε, δ)-almost orthogonal vectors. Thus, by Lemma 2.4.6, we

get thatM is a (k, ℓ)-L2-extractorwith error 2−r, fork = Ω(log(1/δ)) and r = ℓ = Ω(log(1/ε))

(assuming ε ≥ δ). By Theorem 2.3.1, we get the required memory-samples lower bound.

Lemma 2.4.8 ([KRT17]). There exists a (sufficiently small) constant c > 0 such that the following

holds. Let Tℓ = {x ∈ {0, 1}n :∑

i xi = ℓ}. For any ε > (8ℓ/n)ℓ/2, Tℓ is an (ε, δ)-biased set for

δ = 2 · e−ε2/ℓ·n/8. In particular, Tℓ is an (ε, δ)-biased set for

1. ε = 2−cℓ, δ = 2−cn, assuming ℓ ≤ cn.

2. ε = ℓ−cℓ, δ = 2−cn/ℓ0.01 , assuming ℓ ≤ n0.9.

Let c > 0 be the constant mentioned in Lemma 2.4.8. The following lemma complements

Lemma 2.4.8 to the range of parameters cn ≤ ℓ ≤ n/2. It shows that Tℓ is (2−Ω(n), 2−Ω(n))-

biased in this case. The proof is a simple application of Parseval’s identity (see [KRT17]).

Lemma 2.4.9 ([KRT17, Lemma 4.1]). Let T ⊆ {0, 1}n be any set. Then, T is an (ε, δ)-biased set

for δ = 1|T|·ε2 . In particular, T is (|T|−1/3, |T|−1/3)-biased.

We get the following as an immediate corollary.

Corollary 2.4.10. Let Tℓ = {x ∈ {0, 1}n :∑

i xi = ℓ}.

1. Assuming ℓ ≤ n/2, parity learning over Tℓ requires either at leastΩ(n · ℓ)memory bits or

at least 2Ω(ℓ) samples.

2. Assuming ℓ ≤ n0.9, parity learning over Tℓ requires either at leastΩ(n · ℓ0.99)memory bits

or at least ℓΩ(ℓ) samples.

58

2.4.3 Learning from Sparse Linear Equations

Lemma 2.4.5 and the proof of Lemma 2.4.7 gives the following immediate corollary.

Lemma 2.4.11. Let T ⊆ {0, 1}n be an (ε, δ)-biased set, with ε ≥ δ. Then, the matrix M :

T× {0, 1}n → {−1, 1}, defined byM(a, x) = (−1)a·x is a (k, ℓ)-L2-extractor with error 2−r, for

ℓ = Ω(log(1/δ)) and k = r = Ω(log(1/ε)).

Thus, the learning task associated with M (“learning from equa ons in T”) requires either at least

Ω(log(1/ε) · log(1/δ))memory bits or at least poly(1/ε) samples.

We get the following as an immediate corollary of Lemmas 2.4.8, 2.4.9 and 2.4.11.

Corollary 2.4.12. Let Tℓ = {x ∈ {0, 1}n :∑

i xi = ℓ}.

1. Assuming ℓ ≤ n/2, learning from equations in Tℓ requires either at leastΩ(n · ℓ)memory

bits or at least 2Ω(ℓ) samples.

2. Assuming ℓ ≤ n0.9, learning from equations in Tℓ requires either at least Ω(n · ℓ0.99)

memory bits or at least ℓΩ(ℓ) samples.

2.4.4 Learning from LowDegree Equations

In the following, we consider multilinear polynomials in F2[x1, . . . , xn] of degree at most d. We

denote by Pd the linear space of all such polynomials. We denote the bias of a polynomial p ∈

F2[x1, . . . , xn] by

bias(p) := Ex∈Fn

2

[(−1)p(x)].

We rely on the following result of Ben-Eliezer, Hod and Lovett [BEHL12], showing that ran-

dom low-degree polynomials have very small bias with very high probability.

Lemma 2.4.13 ([BEHL12, Lemma 2]). Let d ≤ 0.99 · n. Then,

Prp∈RPd

[|bias(p)| > 2−c1·n/d] ≤ 2−c2·(n≤d)

59

where 0 < c1, c2 < 1 are absolute constants.

Corollary 2.4.14. Let d, n ∈ N, with d ≤ 0.99 · n. Let M : Pd × Fn2 → {−1, 1} be the matrix

defined byM(p, x) = (−1)p(x) for any p ∈ Pd and x ∈ Fn2. Then, the vectors {Mp : p ∈ Pd} are

(ε, δ)-almost orthogonal, for ε = 2−c1n/d and δ = 2−c2(n≤d), (where 0 < c1, c2 < 1 are absolute

constants). In particular,M is a (k, ℓ)-L2-extractor with error 2−r, for k = Ω(( n≤d

))and r = ℓ =

Ω(n/d).

Thus, the learning task associated with M (“learning from degree-d equa ons”) requires either at

leastΩ(( n≤d

)· n/d

)≥ Ω((n/d)d+1)memory bits or at least 2Ω(n/d) samples.

Proof. We reinterpret [BEHL12, Lemma 2]. Since Pd is a linear subspace, for any fixed p ∈ Pd

and a uniformly random q ∈R Pd, we have that p + q is a uniformly random polynomial in Pd.

Thus, for any fixed p ∈ Pd, at most 2−c2·(n≤d) fraction of the polynomials q ∈ Pd have

|bias(p+ q)| ≥ 2−c1·n/d.

In other words, we get that {Mp : p ∈ Pd} are (ε, δ)-almost orthogonal vectors for ε = 2−c1·n/d

and δ = 2−c2·(n≤d). We apply Lemma 2.4.6 to get the “in particular” part, noting that in our case

Ω(min{log(1/ε), log(1/δ)}

)= Ω(n/d). We apply Theorem 2.3.1 to get the “thus” part.

2.4.5 Learning LowDegree Polynomials

Lemma 2.4.5 and Corollary 2.4.14 gives the following immediate corollary.

Corollary 2.4.15. Let d, n ∈ N, with d ≤ 0.99 · n. Let M : Fn2 × Pd → {−1, 1} be the matrix

defined byM(a, p) = (−1)p(a) for any p ∈ Pd and a ∈ Fn2. Then,M is a (k, ℓ)-L2-extractor with

error 2−r, for ℓ = Ω(( n≤d

))and k = r = Ω(n/d).

Thus, the learning task associated withM (“learning degree-d polynomials”) requires either at least

Ω(( n≤d

)· n/d

)≥ Ω((n/d)d+1)memory bits or at least 2Ω(n/d) samples.

60

2.4.6 Relation to Statistical-Query-Dimension

Let C be a class of functionsmappingA to {−1, 1}. The Sta s cal-Query-Dimension of C, denoted

SQdim(C), is defined to be the maximalm such that there exist functions f1, . . . , fm ∈ C with

|⟨fi, fj⟩| ≤ 1/m for all i = j [Kea98, BFJ+94]. As a corollary of Lemma 2.4.5 and Lemma 2.4.6,

we get the following.

Corollary 2.4.16. Let C be a class of functions mapping A to {−1, 1}. Let SQdim(C) = m. Let

f1, . . . , fm ∈ C with |⟨fi, fj⟩| ≤ 1/m for any i = j. Define the matrix M : A × [m] → {−1, 1}

whose columns are the vectors f1, . . . , fm. Then, M is a (k, ℓ)-L2-extractor with error 2−r for k =

ℓ = r = Ω(logm).

Thus, the learning task associated withM requires either at leastΩ(log2m)memory bits or at

least mΩ(1) samples.

Proof. Consider the rows of the matrixMt. By our assumption, the rows ofMt are (1/m, 1/m)-

almost orthogonal. Thus, by Lemma 2.4.6, Mt is a (k, ℓ)-L2-extractor with error 2−r, for k =

ℓ = r = Ω(logm). By Lemma 2.4.5,M is a (k, ℓ)-L2-extractor with error 2−r for k = ℓ = r =

Ω(logm). We apply Theorem 2.3.1 to get the “thus” part.

In fact, we get the following (slight) generalization. Suppose that there arem′ ≥ m functions

f1, . . . , fm′ mappingA to {−1, 1}with |⟨fi, fj⟩| ≤ 1/m for all i = j. Then, the learning task asso-

ciated with thematrix whose columns are f1, . . . , fm′ requires either at least Ω(log(m) · log(m′))

memory bits or at leastmΩ(1) samples.

2.4.7 Comparisonwith [Raz17]

SmallMatrixNorm impliesL2-Extractor. This result generalizes the result of [Raz17]

that if a matrixM : A×X → {−1, 1} is such that the largest singular value ofM, σmax(M), is at

most |A| 12 |X| 12−ε, then the learning problem represented byM requires either a memory of size

at least Ω ((εn)2) or at least 2Ω(εn) samples, where n = log2 |X|. We use the following lemma:

61

Lemma 2.4.17. If a matrix M : A × X → {−1, 1} satisfies σmax(M) ≤ |A| 12 · |X| 12−ε, then

M is a (k, ℓ)-L2-Extractor with error 2−r for every k, ℓ, r > 0 such that k + 2ℓ + 2r ≤ 2εn

(n = log2(|X|)).

Theorem2.3.1 andLemma2.4.17with k = εn, ℓ = r = εn4 , imply themain result of [Raz17].

Proof. As σmax(M) ≤ |A| 12 |X| 12−ε, for a non-negative function f : X → R, ∥M · f∥2 ≤ |X|1−ε ·

∥f∥2. In other words, (E

a∈RA

[|(M · f)a|2

])1/2

≤ |X|1−ε · ∥f∥2

=⇒(

Ea∈RA

[|⟨Ma, f⟩|2

])1/2

≤ |X|−ε · ∥f∥2

=⇒

(E

a∈RA

[(|⟨Ma, f⟩|∥f∥1

)2])1/2

≤ 2−εn · ∥f∥2∥f∥1

Now if ∥f∥2∥f∥1 ≤ 2ℓ for some ℓ > 0, then

Ea∈RA

[(|⟨Ma, f⟩|∥f∥1

)2]≤ 2−2εn+2ℓ .

Applying Markov’s inequality, we get that there are at most 2−2εn+2ℓ+2r · |A| rows a ∈ A with|⟨Ma,f⟩|∥f∥1 ≥ 2−r.

2.4.8 Comparisonwith [MM18]

We will now show that the results (in this chapter) subsume the results of [MM18]. Moshkovitz

and Moshkovitz [MM18] consider matricesM : A × X → {−1, 1}, and a parameter d, with

the property that for any A′ ⊆ A and X′ ⊆ X the bias of the submatrix MA′×X′ is at mostd√|A′|·|X′|

. They definem = |A|·|X|d2 and prove that any learning algorithm for the corresponding

learning problem requires either a memory of size Ω((logm)2) ormΩ(1) samples. We note that

this is essentially the same result as the one proved in [Raz17], and since it is always true that d2 ≥

max {|X|, |A|}, the boundobtained on thememory is atmostΩ (min {(log |X|)2, (log |A|)2}).

62

Note that ifM satisfies that property (requiredby [MM18]), then, in particular, any submatrix

A′ × X′ ofM of at leastm−1/4 · |A| rows and at leastm−1/4 · |X| columns, has a bias of at most

d√|A′|·|X′|

= d√|A|·|X|

·√|A|·|X|√|A′|·|X′|

≤ m−1/2 ·m1/4 = m−1/4.

Thus, we can apply Corollary 2.3.3, with k, ℓ, r = 14 log(m) to obtain the same result.

2.5 Generalization toNon-Product Distributions

We state time-space lower bounds for generalized joint distribution p : A × X → [0, 1] of

joint random variable (A,X ) on the space of samples a and secret x respectively. We denote by

pA,X : A × X → [0, 1] the joint distribution of (A,X ), by pA : A → [0, 1] the marginal

distribution ofA, i.e. pA(a′) =

∑x′ pA,X (a

′, x′), by pX : X → [0, 1] the marginal distribution

of X , i.e. pX (x′) =

∑a′ pA,X (a

′, x′), and by pA|X=x : A → [0, 1] the conditional distribution

ofA givenX = x, i.e. pA|X=x(a′) =

pA,X (a′,x)pX (x) .

Viewing the Learning Problem as aMatrix

Let X, A be two finite sets of size larger than 1. Let n = log2 |X|.

LetM : A× X → [0, 1] be the matrix corresponding to a joint probability distribution pA,X

over spaceA×X defined byM(a, x) = pA|X=x(a). We assume that themarginal distributionpX

is a uniform distribution over X. Thus,M(a, x) = pA|X=x(a) = 2n · pA,X (a, x). Also, w.l.o.g.

assume that pA(a) > 0 for all a ∈ A.

The matrixM corresponds to the following learning problem: There is an unknown element

x ∈ X that was chosen uniformly at random. A learner tries to learn x from samples a, where

a ∈ A is chosen at random with probabilityM(a, x). That is, the learning algorithm is given a

stream of (independently chosen) samples, a1, a2 . . ., where each at has the probability distribu-

tion pA|X=x.

Let M : A×X → [−1, 1]be the normalizedmatrix corresponding toMdefinedbyM(a, x) =

63

M(a, x)− pA(a).

NewDefinition of L2-Extractors for the GeneralizedMatrix

Definition 2.5.1. Let X,Abe two finite sets andM : A×X → [0, 1] be associatedwith probability

distribution pA,X (as defined above). M : A × X → [0, 1] is a (k, ℓ, p)-L2-Extractor with error

2−r, if for every non-negative f : X → R with ∥f∥2∥f∥1 ≤ 2ℓ, the set of rows a in A with

|⟨Ma, f⟩|∥f∥1

≥ 2−r · pA(a)

has probability mass at most 2−k under distribution pA, and if

∀a ∈ A, x ∈ X :M(a, x)pA(a)

=pA|X=x(a)pA(a)

≤ 2p.

Weprove that if the learningmatrixM is a (k, ℓ, p)-L2-Extractorwith error 2−r, then the learn-

ing problem associated withM requires either Ω( kℓp )memory size or 2Ω(r) samples. The proof

is similar to the proof of Theorem 2.3.1.

Branching Program for the Generalized Learning Problem


the matrixM, by a branching program.

Definition 2.5.2. Branching Program for the Learning Problem: A branching program of

length m and width d, for learning, is a directed (multi) graph with vertices arranged in m + 1

layers containing at most d vertices each. In the first layer, that we think of as layer 0, there is only

one vertex, called the start vertex. A vertex of outdegree 0 is called a leaf. All vertices in the last

layer are leaves (but there may be additional leaves). Every non-leaf vertex in the program has |A|

outgoing edges, labeled by elements a ∈ A, with exactly one edge labeled by each such a, and all

these edges going into vertices in the next layer. Each leaf v in the program is labeled by an element

x(v) ∈ X, that we think of as the output of the program on that leaf.

64

Computation-Path: The samples a1, . . . , am ∈ A that are given as input, define a

computation-path in the branching program, by starting from the start vertex and following at step t

the edge labeled by at, until reaching a leaf. The program outputs the label x(v) of the leaf v reached

by the computation-path.

Success Probability: The success probability of the program is the probability that x = x, where

x is the element that the program outputs, and the probability is over x, a1, . . . , am (where x is uni-

formly distributed over X and a1, . . . , am have the probability distribution pA|X=x (pA|X=x(a) =

M(a, x)).

2.5.1 Main Theorem

Theorem 2.5.1. Let 1100 < c < 2


Let X, A be two finite sets. Let n = log2 |X|. Let M : A × X → [0, 1] be a matrix associated

with probability distribution pA,X which is a (k′, ℓ′, p)-L2-extractor with error 2−r′ for p ≥ 1.

Letting Cγ,c > max{ 62γ2/3−c ,

11−γ} be some sufficiently large constant, we assume that r′ ≥ Cγ,c

and ℓ′, k′ ≥ Cγ,c · p. Let

r := min{

r′2 ,

(1−γ)k′−p2 ,

(1−γ)ℓ′−p−12

}. (2.15)

LetB be a branching programof length atmost2r andwidthatmost2ck′ℓ′/p for the learning problem


Proof. Let

k :=γk′

pand ℓ := γℓ′/3. (2.16)

Note that by the assumption that k′, ℓ′ and r′ are sufficiently large, we get that k, ℓ and r are also

sufficiently large. Since ℓ′ ≤ n, we have ℓ+ r ≤ γℓ′3 +

(1−γ)ℓ′2 < ℓ′

2 ≤ n2 . Thus,

r < n/2− ℓ. (2.17)

65

Let B be a branching program of length m = 2r and width d = 2ck′ℓ′/p for the learning

problem that corresponds to the matrixM. We will show that the success probability of B is at

mostO(2−r).

2.5.2 The Truncated-Path

Again, wewill define the truncated-path,T , to be the same as the computation-path ofB, except

that it sometimes stops before reaching a leaf, as defined below.

Recall that for a vertex v in layer-iofB, wedenote byEv the event thatT reaches the vertex v. We

denote by Pr(v) = Pr(Ev) the probability for Ev (where the probability is over x, a1, . . . , am),

and we denote by Px|v = Px|Ev the distribution of the random variable x conditioned on the

event Ev. Similarly, we denote by Pai+1|v = Pai+1|Ev the distribution of the random variable ai+1

conditioned on the event Ev.

The following are the three cases inwhich the truncated-pathT stops on a non-leaf v, in layer-i

of B:

1. If v is a, so called, significant vertex, where the ℓ2 norm of Px|v is non-negligible.

2. If Px|v(x) is non-negligible.

3. If (M·Px|v)(ai+1)

pA(ai+1)is non-negligible. (Intuitively, this means that T is about to traverse a “bad”

edge, which is traversed with a non-negligibly higher or lower probability than it’s proba-

bility under pA and hence might give a lot of new information about x).

Next, we describe the three cases more formally. The definitions of significant vertices and

values remain the same (as in Section 2.3). We define them again just for convenience.


We say that a vertex v in layer-i of B is significant if

∥Px|v∥2 > 2ℓ · 2−n.

66

Significant Values

Even if v is not significant,Px|vmayhave relatively large values. For a vertex v in layer-iofB, denote

by Sig(v) the set of all x′ ∈ X, such that,

Px|v(x′) > 22ℓ+2r · 2−n.

Bad Edges

For a vertex v in layer-i of B, denote by Bad(v) the set of all α ∈ A, such that,

∣∣∣(M · Px|v)(α)∣∣∣ ≥ 2−r′ · pA(α).

We define the truncated-path T again just for completeness.


We define T by induction on the layers of the branching program B. Assume that we already

defined T until it reaches a vertex v in layer-i of B. The path T stops on v if (at least) one of the

following occurs:


2. x ∈ Sig(v).

3. ai+1 ∈ Bad(v).

4. v is a leaf.

Otherwise, T proceeds by following the edge labeled by ai+1 (same as the computational-path).


Since T follows the computation-path of B, except that it sometimes stops before reaching a

leaf, the success probability of B is bounded (from above) by the probability that T stops before

67

reaching a leaf, plus the probability that T reaches a leaf v and x(v) = x.

The main lemma needed for the proof of Theorem 2.5.1 is Lemma 2.5.1 that shows that the

probability that T reaches a significant vertex is at mostO(2−r).

Lemma 2.5.1. The probability that T reaches a significant vertex is at most O(2−r).

Lemma 2.5.1 is proved in Section 2.5.4. We will now show how the proof of Theorem 2.5.1

follows from that lemma.

Lemma 2.5.1 shows that the probability that T stops on a non-leaf vertex, because of the first

reason (i.e., that the vertex is significant), is small. The next two lemmas imply that the probabil-

ities that T stops on a non-leaf vertex, because of the second and third reasons, are also small.


Prx[x ∈ Sig(v) | Ev] ≤ 2−2r.

The proof is exactly the same as that of Claim 2.3.1.

Claim 2.5.2. If v is a non-significant vertex, in layer-i of B, then

Prai+1

[ai+1 ∈ Bad(v) | Ev] ≤ 2−2r.

Proof. Since v is not significant, ∥Px|v∥2 ≤ 2ℓ · 2−n. Since Px|v is a distribution, ∥Px|v∥1 = 2−n.


≤ 2ℓ ≤ 2ℓ′ .

SinceM is a (k′, ℓ′, p)-L2-extractorwith error 2−r′ , there is atmost 2−k′ probabilitymass on α ∈ A

(under pA) with ∣∣∣⟨Mα,Px|v⟩∣∣∣

∥Px|v∥1= |(M · Px|v)α| ≥ 2−r′ · pA(α).

However, in the statement of the claim, ai+1 has the probability distribution Pai+1|v, which is

68

defined as follows:

Pai+1|v(a′) := Prai+1

[ai+1 = a′ | Ev] =∑x′

Prai+1,x

[ai+1 = a′, x = x′ | Ev]

=∑x′

Prx[x = x′ | Ev] · Pr

ai+1[ai+1 = a′ | x = x′,Ev]

=∑x′

Prx[x = x′ | Ev] · Pr

ai+1[ai+1 = a′ | x = x′].

The last equality follows because given x′, the sampleai+1 is chosen independently of the previous

samples. Therefore,

Pai+1|v(a) =∑x′

Px|v(x′) · Pai+1|x=x′(a) =∑x′

Px|v(x′) · pA|X=x′(a) ≤ 2p · pA(a), (2.18)

where the last inequality follows from the assumption that ∀a, x′, pA|X=x′(a) ≤ 2p ·pA(a). We

get

Prai+1

[ai+1 ∈ Bad(v) | Ev] =∑

a∈Bad(v)

Pai+1|v(a) ≤ 2p ·∑

a∈Bad(v)

pA(a) ≤ 2−k′+p ,

which completes the proof since k′ − p ≥ 2r (Equation (2.15)).

We cannowuseLemma2.5.1,Claim2.5.1 andClaim2.5.2 toprove that theprobability thatT

stops before reaching a leaf is at most O(2−r). Lemma 2.5.1 shows that the probability that T

reaches a significant vertex and hence stops because of the first reason, is at mostO(2−r). Assum-

ing that T doesn’t reach any significant vertex (in which case it would have stopped because of

the first reason), Claim 2.5.1 shows that in each step, the probability that T stops because of the

second reason, is at most 2−2r. Taking a union bound over the m = 2r steps, the total proba-

bility that T stops because of the second reason, is at most 2−r. In the same way, assuming that

T doesn’t reach any significant vertex (in which case it would have stopped because of the first

reason), Claim 2.5.2 shows that in each step, the probability that T stops because of the third

reason, is at most 2−2r. Again, taking a union bound over the 2r steps, the total probability that

69

T stops because of the third reason, is at most 2−r. Thus, the total probability that T stops (for

any reason) before reaching a leaf is at mostO(2−r).

Recall that if T doesn’t stop before reaching a leaf, it just follows the computation-path of B.

Recall also that byLemma2.5.1, theprobability thatT reaches a significant leaf is atmostO(2−r).

Thus, to bound (from above) the success probability of B by O(2−r), it remains to bound the

probability that T reaches a non-significant leaf v and x(v) = x. Claim 2.5.3 shows that for any

non-significant leaf v, conditioned on the event that T reaches v, the probability for x(v) = x is

at most 2−r, which completes the proof of Theorem 2.5.1.

Claim 2.5.3. If v is a non-significant leaf of B then

Pr[x(v) = x | Ev] ≤ 2−r.

The proof is the same as that of Claim 2.3.3.

This completes the proof of Theorem 2.5.1.


Proof. We need to prove that the probability that T reaches any significant vertex is at most

O(2−r). Let s be a significant vertex of B. We will bound from above the probability that T

reaches s, and then use a union bound over all significant vertices of B. Interestingly, the upper

bound on the width of B is used only in the union bound.


Recall that for a vertex v of B in layer-i, we denote by Ev the event that T reaches the vertex v.

For simplicity, we denote by Pr(v) = Pr(Ev) the probability for Ev (where the probability is

over x, a1, . . . , am), and we denote by Px|v = Px|Ev the distribution of the random variable x

conditioned on the event Ev. Similarly, we denote by Pai+1|v = Pai+1|Ev the distribution of the

random variable ai+1 conditioned on the event Ev.

70

Similarly, for an edge e of the branching program B, let Ee be the event that T traverses the

edge e. Denote, Pr(e) = Pr(Ee) (where the probability is over x, a1, . . . , am), and Px|e = Px|Ee .

Similarly, for a pair (v, a) where v is a vertex of B in layer-i and a ∈ A is a possible sample we

denote by Px|a,v(x′) the probability Pr[x = x′|Ev and (ai+1 = a)].

Claim 2.5.4. For any edge e = (v, u) of B, labeled by a, such thatPr(e) > 0, for any x′ ∈ X,

Px|e(x′) =

0 if x′ ∈ Sig(v)pA|X=x′ (a)·Px|v(x′)

pA(a)·ce if x′ ∈ Sig(v)

where ce is a normalization factor that satisfies,

ce ≥ 1− 2 · 2−2r > 12.

Proof. Let e = (v, u) be an edge of B, labeled by a, and such that Pr(e) > 0. Since Pr(e) > 0,

the vertex v is not significant (as otherwise T always stops on v and hencePr(e) = 0). Also, since

Pr(e) > 0, we know that a ∈ Bad(v) (as otherwise T never traverses e and hence Pr(e) = 0).

Assume that v is in layer-i of B.

If T reaches v, it traverses the edge e if and only if x ∈ Sig(v) (as otherwise T stops on v) and

ai+1 = a. As an edge e is equivalent to the pair (v, a), for any x′ ∈ X,

Px|e(x′) =


Px|a,v(x′) · c−11 if x′ ∈ Sig(v)

where c1 is a normalization factor, given by c1 = Prx[x /∈ Sig(v) | Ev]. Since v is not significant,

by Claim 2.5.1, c1 ≥ 1− 2−2r.

We rewrite Px|a,v(x′):

Px|a,v(x′) =Pr[ai+1 = a|x = x′,Ev] · Px|v(x′)

Pai+1|v(a)=

pA|X=x′(a) · Px|v(x′)Pai+1|v(a)

71



e of B that is traversed by T with probability larger than zero, ∥Px|e∥2 cannot be too large.

Claim 2.5.5. For any edge e of B, such that Pr(e) > 0,

∥Px|e∥2 ≤ 2p+1 · 2ℓ · 2−n.

Proof. Let e = (v, u) be an edge of B, labeled by a, and such that Pr(e) > 0. Since Pr(e) > 0,

the vertex v is not significant (as otherwise T always stops on v and hence Pr(e) = 0). Thus,

∥Px|v∥2 ≤ 2ℓ · 2−n.


Px|e(x′) =

0 if x′ ∈ Sig(v)pA|X=x′ (a)Px|v(x′)

pA(a)ce if x′ ∈ Sig(v)

where ce satisfies,

ce ≥ 1− 2 · 2−2r > 12

Thus,

∥Px|e∥2 ≤ c−1e · ∥Px|v∥2 ·maxx

pA|X=x(a)pA(a)

≤ 2 · 2ℓ · 2−n · 2p

Claim 2.5.6.

∥Px|s∥2 ≤ 2p+1 · 2ℓ · 2−n.

Proof. Let Γin(s) be the set of all edges e of B, that are going into s, such that Pr(e) > 0. Note

that ∑e∈Γin(s)

Pr(e) = Pr(s).

73


Px|s(x′) =∑

e∈Γin(s)



Px|s(x′)2 ≤∑

e∈Γin(s)



∥Px|s∥22 ≤∑

e∈Γin(s)



∥Px|e∥22 ≤(2p+1 · 2ℓ · 2−n

)2.

Hence,

∥Px|s∥22 ≤(2p+1 · 2ℓ · 2−n

)2.




[f(z) · g(z)].



Claim 2.5.7.

⟨Px|s,Px|s⟩ > 22ℓ · 2−2n.

74


⟨Px|s,Px|s⟩ = ∥Px|s∥22 > 22ℓ · 2−2n.

Claim 2.5.8.

⟨UX,Px|s⟩ = 2−2n,



⟨UX,Px|s⟩ = 2−2n ·∑z∈X

Px|s(z) = 2−2n.


For i ∈ {0, . . . ,m}, let Li be the set of vertices v in layer-i of B, such that Pr(v) > 0. For

i ∈ {1, . . . ,m}, let Γi be the set of edges e from layer-(i − 1) of B to layer-i of B, such that

Pr(e) > 0. Recall that k = γk′/p (Equation (2.16)).

For i ∈ {0, . . . ,m}, let

Zi =∑v∈Li


For i ∈ {1, . . . ,m}, let

Z ′i =∑e∈Γi




For a vertex v of B, let Γout(v) be the set of all edges e of B, that are going out of v, such that

Pr(e) > 0. Note that ∑e∈Γout(v)

Pr(e) ≤ Pr(v).

(We don’t always have an equality here, since sometimes T stops on v).

75

The next four claims show that the progress made by the branching program is slow.

Claim 2.5.9. For every vertex v of B, such that Pr(v) > 0,

∑e∈Γout(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩k ≤ ⟨Px|v,Px|s⟩k ·

(1+ 2−r

)k+(2−2n+p+1)k .



Thus, we can assume that v is not significant and is not a leaf, and is in layer-i of B.


P(x′) =



Note that by the definition of Sig(v), for any x′ ∈ X,

P(x′) ≤ 22ℓ+2r · 2−n. (2.19)


f(x′) = P(x′) · Px|s(x′).


∥f∥2 ≤ 22ℓ+2r · 2−n · ∥Px|s∥2 ≤ 22ℓ+2r · 2−n · 2p+1 · 2ℓ · 2−n = 23ℓ+2r+p+1 · 2−2n. (2.20)

By Claim 2.5.4, for any edge e ∈ Γout(v), labeled by a, for any x′ ∈ X,

Px|e(x′) =



76

where ce satisfies,

ce ≥ 1− 2 · 2−2r.

Therefore, for any edge e ∈ Γout(v), labeled by a, for any x′ ∈ X,

Px|e(x′) · Px|s(x′) = f(x′) · pA|X=x′ (a)pA(a) · c−1e

and hence, we have



[f(x′) · c−1e · pA|X=x′ (a)pA(a) ]

= Ex′∈RX

[f(x′) · c−1e ·

(1+ M(a,x′)

pA(a)

)]=(∥f∥1 + ⟨Ma,f⟩

pA(a)

)· (ce)−1

<(∥f∥1 + |⟨Ma,f⟩|

pA(a)

)·(1+ 2−2r+2) (2.21)

(where the last inequality holds by the bound that we have on ce, because we assume that k′, ℓ′, r′

and thus r are sufficiently large).


Case I: ∥f∥1 < 2−2n

In this case, we bound |⟨Ma,f⟩|pA(a) ≤ (2p − 1) · ∥f∥1 (since f is non-negative and M(a,x)

pA(a) ∈ [−1, 2p −

1]∀a ∈ A, x ∈ X) and since (1 + 2−2r+2) < 2 (since we assume that k′, ℓ′, r′ and thus r are

sufficiently large) we obtain for any edge e ∈ Γout(v),

⟨Px|e,Px|s⟩ < 2p+1 · 2−2n.

Since∑



77



t(a) =|⟨Ma, f⟩|pA(a)∥f∥1

.

By Equation (2.21),

⟨Px|e,Px|s⟩k < ∥f∥k1 · (1+ t(a))k ·(1+ 2−2r+2)k . (2.22)


∥f∥1 = Ex′∈RX

[f(x′)] = ⟨P,Px|s⟩ ≤ ⟨Px|v,Px|s⟩.

Thus, summing over all e ∈ Γout(v), by Equation (2.22),

∑e∈Γout(v)


a∼Pai+1|v

[(1+ t(a))k

]·(1+ 2−2r+2)k . (2.23)

It remains to bound

Ea∼Pai+1|v

[(1+ t(a))k

], (2.24)

using the properties of the matrixM and the bounds on the ℓ2 versus ℓ1 norms of f.

By Equation (2.20), the assumption that ∥f∥1 ≥ 2−2n, Equation (2.15) and Equation (2.16),

we get∥f∥2∥f∥1

≤ 23ℓ+2r+p+1 ≤ 2ℓ′ .

SinceM is a (k′, ℓ′, p)-L2-extractor with error 2−r′ , there is at most 2−k′ probabilitymass on rows

a ∈ Awith t(a) = |⟨Ma,f⟩|pA(a)∥f∥1 ≥ 2−r′ under probability distribution pA. Now, by Equation 2.18,

for all a ∈ A we have Pai+1|v(a) ≤ 2p · pA(a), thus there is at most 2−k′+p probability mass

according to the distribution Pai+1|v on rows awith t(a) ≥ 2−r′ .

78

We bound the expectation in Equation (2.24), by splitting the expectation into two sums

Ea∼Pai+1|v

[(1+ t(a))k

]=

∑a : t(a)≤2−r′

Pai+1|v(a) · (1+ t(a))k +∑

a : t(a)>2−r′

Pai+1|v(a) · (1+ t(a))k

(2.25)

We bound the first sum in Equation (2.25) by (1 + 2−r′)k. As for the second sum in Equa-

tion (2.25), we know that it is a sumof atmost 2−k′+p probabilitymass, and since for every a ∈ A,

we have t(a) ≤ 2p − 1, we have

∑a : t(a)>2−r′

Pai+1|v(a) · (1+ t(a))k ≤ (2−k′+p) · 2pk ≤ 2−2r

(where in the last inequalitywe usedEquations (2.15) and (2.16)). Overall, using Equation (2.15)

again, we get

Ea∼Pai+1|v

[(1+ t(a))k

]≤ (1+ 2−r′)k + 2−2r ≤ (1+ 2−2r)k+1. (2.26)


∑e∈Γout(v)


(1+ 2−2r

)k+1 ·(1+ 2−2r+2)k

< ⟨Px|v,Px|s⟩k ·(1+ 2−r

)k(where the last inequality uses the assumption that r is sufficiently large). This completes the



Z ′i ≤ Zi−1 ·(1+ 2−r

)k+(2−2n+p+1)k .

79


Z ′i =∑e∈Γi

Pr(e) · ⟨Px|e,Px|s⟩k =∑v∈Li−1

Pr(v) ·∑

e∈Γout(v)


≤∑v∈Li−1

Pr(v) ·(⟨Px|v,Px|s⟩k ·

(1+ 2−r

)k+(2−2n+p+1)k)

= Zi−1 ·(1+ 2−r

)k+∑v∈Li−1

Pr(v) ·(2−2n+p+1)k

≤ Zi−1 ·(1+ 2−r

)k+(2−2n+p+1)k


Zi ≤ Z ′i .

The proof is the same as the proof of Claim 2.3.11.


Zi ≤ 2(p+3)k+2r · 2−2k·n.

Proof. By Claim 2.5.8, Z0 = (2−2n)k. By Claim 2.5.10 and Claim 2.5.11, for every i ∈

{1, . . . ,m},

Zi ≤ Zi−1 ·(1+ 2−r

)k+(2−2n+p+1)k .

Hence, for every i ∈ {1, . . . ,m},

Zi ≤(2−2n+p+1)k ·m ·

(1+ 2−r

)km.

Sincem = 2r,

Zi ≤ 2−2k·n · 2(p+1)k · 2r · ek ≤ 2−2k·n · 2(p+3)k+2r.

80


We can now complete the proof of Lemma 2.5.1. Assume that s is in layer-i of B. By Claim 2.5.7,

Zi ≥ Pr(s) · ⟨Px|s,Px|s⟩k > Pr(s) ·(22ℓ · 2−2n

)k= Pr(s) · 22ℓ·k · 2−2k·n.


Zi ≤ 2(p+3)k+2r · 2−2k·n.

Thus, using Equation (2.15), Equation (2.16) and the assumption p ≥ 1, we get

Pr(s) ≤ 2(p+3)k+2r · 2−2ℓ·k ≤ 24k′ · 2−(2γ2/3)· k

′

p ·ℓ′

.

Recall that we assumed that the width of B is at most 2ck′ℓ′/p for some constant c < 2/3, and

that the length of B is at most 2r. Recall that we fixed γ such that 2γ2/3 > c. Taking a union

bound over at most 2r · 2ck′ℓ′/p ≤ 2k′ · 2ck′ℓ′/p significant vertices of B, we conclude that the

probability that T reaches any significant vertex is at most

2−(2γ2/3)·k′ℓ′/p · 2ck′ℓ′/p · 25k′ ≤ 2−k′ ≤ 2−r,

where the first inequality uses the assumption ℓ′ ≥ p · 6(2γ2/3)−c . This completes the proof of the

lemma.

2.5.5 Applications

The generalized model allows us to prove time-space lower bounds for learning problems with

longer outputs. [BGY18] proves memory-sample lower bounds for learning low degree polyno-

mials over Fp, where p is a prime, which is a learning problem with longer ouput.

The learning problemwith longer outputs is defined as follows: A learner tries to learn x (uni-

81

formly chosen from X) from a stream of samples, (a1, b1), (a2, b2), ...where for every i, ai is uni-

formly distributed overA and bi ∈ B is a function f of ai and x, where |B| = 2p. Note that in the

standard model, defined in Section 2.1, p = 1.

ThematrixM : (A×B)×X → [0, 1] corresponding to the above learning problem is defined

as

M((a, b), x) =

0 if f(a, x) = b1|A| if f(a, x) = b

Recall that the associated joint distribution is defined as

p(A,B),X ((a, b), x) = 2−n ·M((a, b), x)

and the normalized matrix M : (A× B)× X → [−1, 1] is defined as

M((a, b), x) =

− 1|A| · Prx′∈RX[f(a, x

′) = b] if f(a, x) = b1|A| −

1|A| · Prx′∈RX[f(a, x

′) = b] if f(a, x) = b

Next, we give an example of such a bound. We can prove that learning from linear equations

over a field F, (that is, given a secret x ∈ Fn, learning from samples (a1, b1), ... ∈ Fn+1 such

that ai is uniformly distributed over Fn and ⟨ai, x⟩ = bi), requires either Ω(n2 log |F|) mem-

ory size or 2Ω(n log |F|) samples. Note that we can learn by storing the first ≈ n samples, imply-

ing a tight n2 log |F| upper bound on the memory. The lower bound follows from the corre-

sponding learning matrixM being a (Ω(n log |F|),Ω(n log |F|), log |F|)-L2-Extractor. In fact,

this follows from the associated matrix, M′((a, b), x) = M((a,b),x)p(A,B)((a,b))

− 1, having low spectral

norm (≤ |F| · 2n log |F|

2 ). To see it, note that the matrixM′ satisfies (M′)t · (M′) = c · I where

c = |F|n · |F| · (|F| − 1) and I is the identity matrix on X.

82

Have no fear of perfection; you’ll never

reach it.

Marie Curie

3Two-Pass Space-Bounded Learning

Algorithms

The results in this chapter are based on joint work with Ran Raz and Avishay Tal [GRT19]. In

this chapter, we generalize the memory-sample tradeoffs to when the learner is allowed a second

pass over the stream of samples.

As discussed before, a large number of recent works studied the problem of proving memory-

samples lower bounds for learning [Sha14, SVW16, Raz16, VV16, KRT17, MM17, Raz17,

MM18, MT17, BGY18, GRT18, DS18]. The motivation for studying this question comes

from learning theory, computational complexity and cryptography (see for example the dis-

83

cussion and references in [Sha14, SVW16, Raz16, VV16, KRT17, MT17]). [SVW16] con-

jectured that any algorithm for learning parities of size n requires either a memory of size

Ω(n2) or an exponential number of samples. This conjecture was proven in [Raz16], fol-

lowed by a line of works that showed that for a large number of learning problems, any learn-

ing algorithm requires either super-linear memory size or a super-polynomial number of sam-

ples [KRT17, Raz17, MM18, BGY18, GRT18].

All previous memory-samples lower bounds (in the regime where the lower bound on the

memory size is super-linear and the lower bound on the number of samples is super-polynomial)

modeled the learning algorithm by a one-pass branching program, allowing only one pass over the

stream of samples. In this work [GRT19], we prove the first such results when two passes over

the stream of samples are allowed. (We remark that we leave open the question of handling more

than two passes. While some parts of the current proof naturally extend tomore than two passes,

others are more delicate.)


As in [Raz17, BGY18, GRT18], we represent a learning problem by a matrix. Let X, A be two

finite sets of size larger than 1 (whereX represents the concept-class that we are trying to learn and

A represents the set of possible samples). LetM : A× X → {−1, 1} be a matrix. The matrixM

represents the following learning problem: An unknown element x ∈ X was chosen uniformly

at random. A learner tries to learn x from a stream of samples, (a1, b1), (a2, b2) . . ., where for

every i, ai ∈ A is chosen uniformly at random and bi = M(ai, x).

Wemodel the learner for the learning problem that corresponds to thematrixM, by a two-pass

ordered branching program (Definition 3.1.2). Such a program reads the entire stream of samples

twice, in the exact same order. Roughly speaking, the model allows a learner with infinite com-

putational power, and bounds only the memory size of the learner and the number of samples

used.

As in [GRT18], the result in this chapter is stated in terms of the properties of thematrixM as

84

a two-source extractor. Two-source extractors, first studied by Santha and Vazirani [SV84] and

Chor and Goldreich [CG88], are central objects in the study of randomness and derandomiza-

tion. As in [GRT18], the results hold whenever the matrix M has (even relatively weak) two-

source extractor properties.

Roughly speaking, the main result can be stated as follows: Assume that k, ℓ, r are such that

any submatrix of M of at least 2−k · |A| rows and at least 2−ℓ · |X| columns, has a bias of at

most 2−r. Then, any two-pass learning algorithm for the learning problem corresponding toM

requires either a memory of size at least Ω(k ·min{k,

√ℓ}), or at least 2Ω(min{k,

√ℓ,r}) samples.

Formally, the result is stated in Theorem 3.2.1 in terms of the properties ofM as an L2-Extractor

(Definition 3.1.1), a notion that was defined in [GRT18] and (as formally proved in [GRT18])

is closely related to the notion of two-source extractor. (The two notions are equivalent up to

small changes in the parameters.)

As in Chapter 2 and [GRT18], the main result can be used to prove (two-pass) memory-

samples lower bounds for many of the problems that were previously studied in this context. For

example, for learning parities, sparse parities, DNFs, decision trees, random matrices, error cor-

recting codes, etc. For example, the main result implies that any two-pass algorithm for learning

parities of size n requires either a memory of size Ω(n1.5) or at least 2Ω(√n) samples.

RelatedWork To the best of our knowledge, the only previous work that provedmemory-

samples lower bounds for more than one pass over the stream of samples, is the intriguing recent

work of Dagan and Shamir [DS18]. We note however that their results apply for a very different

setting and regime of parameters, where the obtained lower bound on the number of samples is

at most polynomial in the dimension of the problem. (Their result is proved in a very different

setting, where the samplesmaybe noisy, and the lower boundobtained on the number of samples

is at most the product of the length of one sample times one over the information given by each

sample).

85

Motivation and Discussion Many previous works studied the resources needed for

learning, under certain information, communication or memory constraints (see in particu-

lar [Sha14, SVW16, Raz16, VV16, KRT17, MM17, Raz17, MM18, MT17, BGY18, GRT18,

DS18] and the many references given there). A main message of some of these works is that for

some learning problems, access to a relatively large memory is crucial. In other words, in some

cases, learning is infeasible, due to memory constraints. Most of these works apply to bounded-

memory learning algorithms that consider the samples one by one, with only one pass over the

samples. In many practical situations, however, more than one pass over the samples is used, so

it’s desirable to extend these results to more than one pass over the samples.

From the point of view of computational complexity, the problem of extending these works

to more than one pass over the samples is fascinating and challenging. It’s a common practice

in streaming-complexity to consider more than one pass over the inputs, and in computational

complexity read-k-times branching programs have attracted a lot of attention.

We note that by Barrington’s celebrated result, any function in NC can be computed by a

polynomial-length branching program of width 5 [Bar89]. Hence, proving super-polynomial

lower bounds on the time needed for computing a function, by a branching program of width 5,

with polynomially many passes over the input, would imply super-polynomial lower bounds for

formula size, and is hence a very challenging problem.

Finally, let us mention that technically, allowing more than one pass over the samples is very

challenging, as all previous techniques are heavily based on the fact that in the one-pass case all the

samples are independent and hence at each time step, the learning algorithm has no information

about the next sample that it is going to see.

Techniques Theproof builds on theworks of [Raz17,GRT18] that gave a general technique

for provingmemory-samples lower bounds for learning problems. However, these works (as well

as all other previous works that prove memory-samples lower bounds in this regime of parame-

ters) are heavily based on the fact that in the one-pass case all the samples are independent and

86

hence at each time step, the learning algorithm has no information about the next sample that

it is going to see. Roughly speaking, the proofs of [Raz17, GRT18] bound the L2-norm of the

distribution of x, conditioned on reaching a given vertex v of the branching program, but they

rely on the fact that the next sample is independent of x. Once one allows more than one pass

over the stream of samples, the assumption that the next sample is independent of x doesn’t hold,

as in the second pass the vertex may remember a lot of information about the joint distribution

of x and a1, . . . , am.

Roughly speaking, [Raz17, GRT18] considered the computation-path of the branching pro-

gram and defined “stopping-rules”. Intuitively, the computation stops if certain “bad” events

occur. The proofs show that each stopping rule is only applied with negligible probability and

that conditioned on the event that the computation didn’t stop, the L2-norm of the distribution

of x, conditioned on reaching a vertex v of the branching program, is small (which implies that

the program didn’t learn x).

When more than one pass over the samples is allowed, there is a serious problem with this

approach. After one pass, a vertex of the branching program has joint information on x and

a1, . . . , am. If we only keep track of the distribution of x conditioned on that vertex, it could be

the case that the next sample completely reveals x. One conceptual problem seems to be that the

second part of the program (that is, the part that is doing the second pass) is not aware of what

the first part did. An idea that turned out to be very important in the proof is to take the second

part to be the product of the first and second part, so that, in some sense, the second part of the

computation runs its own copy of the first part. In addition, we have each vertex in the second

part remembering the vertex reached at the end of the first part.

As in [Raz17, GRT18], we define stopping rules for the computation-path and we prove that

the probability that the computation stops is small. We then analyze each part separately, as a read

once program. For each part separately, we prove that conditioned on the event that the program

didn’t stop, the L2-norm of the distribution of x, conditioned on reaching a vertex v, is small. It

turns out that since the second part of the program runs its own copy of the first part, the analysis

87

of each part separately is sufficient.

We note, however, that the entire proof is completely different than [Raz17, GRT18]. The

stopping rules are different and are defined differently for each part. The proof that the compu-

tation stops with low probability is much more delicate and complicated. The main challenge is

that when analyzing the probability to stop on the second part, we cannot ignore the first part

and we need to prove that we stop with low probability on the second part, when starting from

the start vertex of the first part (that is, the start vertex of the entire program). This turns out to be

very challenging and, in particular, requires a use of the results for one-pass branching programs.

A proof outline is given in Section 3.3.

3.1 Preliminaries

First, we recall certain preliminaries from Section 1.1. For a randomvariableZ and an eventE, we

denote by PZ the distribution of the random variablesZ, and we denote by PZ|E the distribution

of the random variable Z conditioned on the event E. We will sometimes take probabilities and

expectations, conditioned on events E that may be empty. We think of these probabilities and

expectations as 0, when the event E is empty.

3.1.1 Learning Problem

We represent a learning problem by amatrix as in Section 1.1.1. LetX,A be two finite sets of size

larger than 1 (where X represents the concept-class that we are trying to learn and A represents

the set of possible samples). LetM : A × X → {−1, 1} be a matrix. The matrixM represents

the following learning problem: An unknown element x ∈ Xwas chosen uniformly at random.

A learner tries to learn x from a stream of samples, (a1, b1), (a2, b2) . . ., where for every i, ai ∈ A

is chosen uniformly at random and bi = M(ai, x). The learner is allowed a second pass over the

samples. Let n = log |X| and n′ = log |A|.

88

3.1.2 Norms and Inner Products

Let p ≥ 1. For a function f : X → R, denote by ∥f∥p the Lp norm of f, with respect to the


∥f∥p =(

Ex∈RX

[|f(x)|p])1/p

.




[f(x) · g(x)].

For a matrixM : A × X → R and a row a ∈ A, we denote byMa : X → R the function


⟨Ma, f⟩ =(M · f)a|X|

.

3.1.3 L2-Extractors




≥ 2−r .

3.1.4 ComputationalModel


thematrixM, by a branching program. We consider a q-pass ordered branching program. Such a

program reads the entire input q times, in the exact same order. That is, the program has q parts

(that are sequential in time). Each part reads the same stream in the exact same order. The main

result is proved for two-pass ordered branching programs, that is, for the case q = 2.

89

Definition 3.1.2. q-Pass Branching Program for a Learning Problem: A q-pass (ordered)

branching program of length q · m and width d, for learning, is a directed (multi) graph with

vertices arranged in qm + 1 layers containing at most d vertices each. In the first layer, that we

think of as layer 0, there is only one vertex, called the start vertex. A vertex of outdegree 0 is called

a leaf. All vertices in the last layer are leaves (but there may be additional leaves). Every non-leaf

vertex in the program has 2|A| outgoing edges, labeled by elements (a, b) ∈ A × {−1, 1}, with

exactly one edge labeled by each such (a, b), and all these edges going into vertices in the next layer.

Each leaf v in the program is labeled by an element x(v) ∈ X, that we think of as the output of the

program on that leaf.

Computation-Path: The samples (a1, b1), . . . , (am, bm) ∈ A×{−1, 1} that are given as input

define a computation-path in the branching program, by starting from the start vertex and following

at step (j− 1) ·m+ i the edge labeled by (ai, bi) (where j ∈ [q] and i ∈ [m]), until reaching a leaf.

The program outputs the label x(v) of the leaf v reached by the computation-path.



x is uniformly distributed over X and a1, . . . , am are uniformly distributed over A, and for every i,

bi = M(ai, x)).

Remark: We will sometimes consider branching programs in which the leaves are not labeled,

and hence the program doesn’t return any value. It will be convenient to refer to such objects also as

branching programs. In particular, we will view a part of the branching program (e.g., the first few

layers of a program) also as a branching program.

We think of the program as composed of q parts, where for every j ∈ [q], part-j contains layers

{(j− 1) ·m+ i}i∈[m].

For convenience, we think of each vertexu of the branching program as having a smallmemory

Su that contains some information about the path that led to the vertex, that the vertex “remem-

bers” (or “records”). Formally, this means that in the actual branching program the vertex u is

split into distinct vertices u1, . . . , ud(u), according to the content of the memory Su. Adding in-

90

formation to Su means that the vertex u is further split into distinct vertices, according to the

content of the information that was added. Thus, when we refer to a vertex u of a program, we

mean, a vertex u plus content of the memory Su.

For this result, we will have the property that whenever we add some information to themem-

ory of a vertex u, that information is never removed/forgotten. That is, information that was

added to the memory of u, remains in the memory of all the vertices that can be reached from u.

As mentioned above, we focus on the case q = 2. We denote by v0 the start vertex of the

program and by v1 the vertex reached at the end of the first part, that is, layer-m. Note that v1 is a

random variable that depends on x, a1, . . . , am.

3.1.5 Product of Programs

Intuitively, the product of two branching programs is a branching program that runs both pro-

grams in parallel.

Definition 3.1.3. Product of One-Pass Branching Programs: Let B,B′ be two one-pass branch-

ing programs for learning, of length m and widths d, d′, respectively. The product B× B′ is a (one-

pass) branching program of length m and width d · d′, as follows: For every i ∈ {0, . . . ,m} and

vertices v in layer-i of B and v′ in layer-i of B′, we have a vertex (v, v′) in layer-i of B×B′. For every

two edges: (u, v) from layer-(i − 1) to layer-i of B and (u′, v′) from layer-(i − 1) to layer-i of B′,

both labeled by the same (a, b), we have in B× B′ an edge ((u, u′), (v, v′)) labeled by (a, b).

The label of a leaf (v, v′) is the label given by the second program B′. The content of the memory

S(v,v′) of a vertex (v, v′) is the concatenation of the content of Sv and the content of Sv′ .

Remark: We will use this definition also in cases where the leaves of B and/or B′ are not labeled

(that is, where B and/or B′ do not output any value; see a remark in Definition 3.1.2).

3.2 Main Result

Fix k, ℓ, r ∈ N, such that rk ,

rℓare smaller than a sufficiently small constant and k < n′, ℓ < n.

Let ε > 0 be a sufficiently small constant. In particular, we assume that ε is sufficiently smaller

91

than all other constants that we discuss, say, ε < 11010 . We assume that n, n′ are sufficiently large.

Let

ℓ = min{k,√ℓ}.

Let

r = min{

r100 ,

ℓ100

}. (3.1)

We assume that

r > 100 ·max {log n, log n′} . (3.2)

We assume thatM is a (10k, 10ℓ)-L2-extractor with error 2−10r.

Theorem 3.2.1. Let X, A be two finite sets. Let n = log2 |X| and n′ = log2 |A|. Fix k, ℓ, r ∈ N,

such that, rk ,

rℓ< 1

100 , and k < n′, ℓ < n. Let ε > 0 be a sufficiently small constant, say, ε < 11010 .

Assume that n, n′ are sufficiently large. Let

ℓ = min{k,√ℓ}.

Let

r = min{

r100 ,

ℓ100

}.

Assume that

r > 100 ·max {log n, log n′} .

Let M : A × X → {−1, 1} be a matrix which is a (10k, 10ℓ)-L2-extractor with error 2−10r.

Let B be a two-pass ordered branching program of length 2 ·m, where m is at most 2εr, and width

at most d = 2εkℓ/10, for the learning problem that corresponds to the matrix M. Then, the success

probability of B is at most 1100 + o(1).

92

3.3 Overview of the Proof

One-Pass Learners

We will start with giving a short outline of the proof of [Raz17, GRT18] for one-pass learners.

Assume thatM is a (10k, 10ℓ)-L2-extractor with error 2−10r, where r < k, ℓ. Let B be a one-pass

branching program for the learning problem that corresponds to the matrix M. Assume for a

contradiction that B is of lengthm = 2εr and width d = 2εkℓ, where ε is a small constant.

We define the truncated-path, T , to be the same as the computation-path of B, except that it

sometimes stops before reaching a leaf. Roughly speaking,T stops before reaching a leaf if certain

“bad” events occur. Nevertheless, we show that the probability thatT stops before reaching a leaf

is negligible, so we can think of T as almost identical to the computation-path.



Px|v = Px|Ev the distribution of the random variable x conditioned on the event Ev. Similarly, for

an edge e of the branching program B, let Ee be the event that T traverses the edge e. Denote,

Pr(e) = Pr(Ee), and Px|e = Px|Ee .

A vertex v of B is called significant if

∥Px|v∥2 > 2ℓ · 2−n.

Roughly speaking, this means that conditioning on the event that T reaches the vertex v, a non-

negligible amount of information is known about x. In order to guess x with a non-negligible

success probability,T must reach a significant vertex. We show that theprobability thatT reaches

any significant vertex is negligible, and thus the main result follows.

To prove this, we show that for every fixed significant vertex s, the probability that T reaches

s is at most 2−Ω(kℓ) (which is smaller than one over the number of vertices in B). Hence, we can

use a union bound to prove the bound.

93

The proof that the probability that T reaches s is extremely small is themain part of the proof.

To that end, we use the following functions to measure the progress made by the branching pro-

gram towards reaching s.

Let Li be the set of vertices v in layer-i of B, such that Pr(v) > 0. Let Γi be the set of edges e

from layer-(i− 1) of B to layer-i of B, such that Pr(e) > 0. Let

Zi =∑v∈Li

Pr(v) · ⟨Px|v,Px|s⟩k,

Z ′i =∑e∈Γi


We think ofZi,Z ′i as measuring the progress made by the branching program, towards reaching

a state with distribution similar to Px|s.

We show that eachZimayonly benegligibly larger thanZi−1. Hence, since it’s easy to calculate

thatZ0 = 2−2nk, it follows thatZi is close to 2−2nk, for every i. On the other hand, if s is in layer-i

thenZi is at leastPr(s) · ⟨Px|s,Px|s⟩k. Thus,Pr(s) · ⟨Px|s,Px|s⟩k cannot bemuch larger than 2−2nk.

Since s is significant, ⟨Px|s,Px|s⟩k > 2ℓk · 2−2nk and hencePr(s) is at most 2−Ω(kℓ).

The proof thatZi may only be negligibly larger thanZi−1 is done in two steps. We show by a

simple convexity argument thatZi ≤ Z ′i . Thehardpart is to prove thatZ ′i mayonly benegligibly

larger thanZi−1.

For this proof, we define for every vertex v, the set of edges Γout(v) that are going out of v, such

that Pr(e) > 0 and show that for every vertex v,

∑e∈Γout(v)

Pr(e) · ⟨Px|e,Px|s⟩k

may only be negligibly higher than


94

For this proof, we consider the function Px|v · Px|s. We first show how to bound ∥Px|v · Px|s∥2.

We then consider two cases: If ∥Px|v ·Px|s∥1 is negligible, then ⟨Px|v,Px|s⟩k is negligible and doesn’t

contributemuch, andwe show that for every e ∈ Γout(v), ⟨Px|e,Px|s⟩k is also negligible anddoesn’t

contribute much. If ∥Px|v · Px|s∥1 is non-negligible, we use the bound on ∥Px|v · Px|s∥2 and the

assumption thatM is a (10k, 10ℓ)-L2-extractor to show that for almost all edges e ∈ Γout(v), we

have that ⟨Px|e,Px|s⟩k is very close to ⟨Px|v,Px|s⟩k. Only an exponentially small (2−k) fraction of

edges are “bad” and give a significantly larger ⟨Px|e,Px|s⟩k.

The reason that in the definitions of Zi and Z ′i we raised ⟨Px|v,Px|s⟩ and ⟨Px|e,Px|s⟩ to the

power of k is that this is the largest power for which the contribution of the “bad” edges is still

small (as their fraction is 2−k).

This outline oversimplifies many details. Let us briefly mention two of them. First, it is not so

easy to bound ∥Px|v · Px|s∥2. We do that by bounding ∥Px|s∥2 and ∥Px|v∥∞. In order to bound

∥Px|s∥2, we forceT to stopwhenever it reaches a significant vertex (and thuswe are able to bound

∥Px|v∥2 for every vertex reached by T ). In order to bound ∥Px|v∥∞, we force T to stop whenever

Px|v(x) is large, which allows us to consider only the “bounded” part of Px|v. (This is related

to the technique of flattening a distribution that was used in [KR13]). Second, some edges are

so “bad” that their contribution to Z ′i is huge so they cannot be ignored. We force T to stop

before traversing any such edge. (This is related to an idea that was used in [KRT17] of analyzing

separately paths that traverse “bad” edges). We show that the total probability thatT stops before

reaching a leaf is negligible.

Thus, in [Raz17, GRT18] there are three stopping rules: We stop if we reach a significant

vertex. We stop if we have a bad edge and we stop if x is a significant-value of Px|v, that is, if

Px|v(x) is too large.

Two-Pass Learners

Let us now give a short outline of the additional ideas in the proof for two-pass learners. Let B

be a two-pass branching program for the learning problem that corresponds to the matrix M.

95

We denote by v0 the starting vertex of the program and by v1 the vertex reached at the end of the

first part. We assume without loss of generality that the answers are given in the last layer of the

program.

We update the second part so that every vertex v in the second part “remembers” v1. This

information is stored in thememory Sv. Formally, this means that starting from every possible v1,

we have a separate copy of the entire second part of the program. We then change the second part

so that it is now theproduct (see definition 3.1.2) of the first part and the secondpart. Intuitively,

this means that the second part runs a copy of the first part of the computation, in parallel to its

own computation.

As in [Raz17, GRT18], we define the truncated-path, T , to be the same as the computation-

path of the new branching program, except that it sometimes stops before reaching a leaf.

Roughly speaking, T stops before reaching a leaf if certain “bad” events occur. Nevertheless,

we show that the probability that T stops before reaching a leaf is small, so we can think of T as

essentially identical to the computation-path. The decision of whether or not T stops on a given

vertex v in layer-i of part-jwill depend on v, Sv, x, ai+1. For that reason, we are able to consider the

path T , starting from any vertex v (without knowing the history of the path that led to v, except

for the information stored in Sv).

Let v be a vertex in the second part of the program (where an answer should be given). The

vertex v remembers (in Sv) the vertex v1. We denote by v1 → v the event that the path T that

starts from v1 reaches v (where v1 is the vertex at the end of the first part of the program that v

remembers, and the event is over x, a1, . . . , am). We denote by v0 → v the event that the path

T that starts from the start vertex v0 reaches v. More generally, for two vertices w1,w2 in the

program, we denote by w1 → w2 the event (over x, a1, . . . , am) that the path T that starts from

w1 reaches w2.

Let vbe a vertex in the last layer of the program, such thatPr(v0 → v) > 0. Since v remembers

v1, the event v0 → v is equivalent to v0 → v1 → v (where v1 is the vertex remembered by v). Since

the second part of the program runs a copy of the first part and since v is in the last layer, the event

96

v1 → v implies the event v0 → v1. Thus, the event v0 → v is equivalent to v1 → v.

Moreover, this is true when conditioning on x, and hence,

Px|v0→v = Px|v1→v

and

Pr[v0 → v] = Pr[v1 → v].

This is a crucial point as it means that

∥Px|v0→v∥2 = ∥Px|v1→v∥2,

that is, if we bound ∥Px|v1→v∥2 we also get a bound on ∥Px|v0→v∥2.

The bound on ∥Px|v0→v∥2 is what we really need because if this is small then the program can-

not answer correctly. On the other hand, the bound on ∥Px|v1→v∥2 is easier to obtain as it is a

bound for a one-pass branching program. Thus, all we need is a bound on ∥Px|v1→v∥2, which is

a bound for a one-pass branching program, and we already know how to obtain bounds on the

conditional distribution for one-pass programs.

Things, however, are not so simple, as we need to prove that T stops with small probability,

when starting from v0, rather than v1. The main problemwith using the previous stopping rules

(in the second part of the program) is that it’s impossible to prove that we stop on a bad edgewith

negligible probability (as demonstrated next). Roughly speaking, we say that an edge e = (u, v)

is “bad” if the equation on it splits the distribution Px|u in a biased way. That is, a good edge is

one where roughly half the probability mass of Px|u satisfies the equation on the edge e. If the

program stores in memory the i-th sample from the first pass, then in the i-th step of the second

pass, an edge e = (u, v)will definitely be bad, since it will not split the distribution Px|u evenly.

For that reason, we change the bad-edges stopping rule. We say that an edge (v, u), labelled

by (a, b), is of high probability if the probability to sample a, conditioning on reaching v from v0

97

(that is, reaching v from the starting vertex of the entire program) is large. The third stopping rule

is changed so that T doesn’t stop on a bad edge if it is of high probability. Instead, if T traverses

such an edge, we “remember” the time step inwhichT traversed that edge, in all the future. That

is, we enter the index i to Su (and remember it in all the future, until the end of the program).

In addition, we add a stopping rule that stops if the edge is “very-bad” and a stopping rule that

stops if the number of indices in Sv is too large, that is, if the number of high-probability edges

that were already traversed is too large (intuitively, Sv won’t be too large because of the bounded

memory size).

We analyze separately the probability to stop because of each stopping rule. The main chal-

lenge is that we need to analyze these probabilities when starting from v0, that is, when running

a two-pass program. These proofs are technically hard, but the main reason that we manage to

analyze these probabilities is the following:

Recall that the second part of the program runs a copy of the first part of the program. Thus,

a vertex v in layer-i of the second part has a corresponding vertex v′ in layer-i of the first part, such

that, if the pathT reached v it previously reached v′. Recall also that v remembers v1, so if the path

T reached v it previously reached v1. Thus, the event v0 → v is equivalent to v0 → v′ → v1 → v,

that is, the event

(v0 → v′) ∧ (v′ → v1) ∧ (v1 → v).

Since the second part of the program runs a copy of the first part, the event v1 → v implies the

event v0 → v′. Hence, the event v0 → v is equivalent to the event

(v′ → v1) ∧ (v1 → v).

Note that v′ is in layer-i of the first part and v is in layer-i of the second part, and from layer-i of

the first part to layer-i of the second part, the program is a one-pass program and is hence easier

to analyze.

98

3.4 Proof ofMain Theorem

In this section, we prove Theorem 3.2.1. Assume that we have a two-pass ordered branching

program, B, for the learning problem that corresponds to thematrixM. We assume without loss

of generality that the output is given in the last layer. Assume that the length of the program is

2 ·m, wherem is at most 2εr and the width of the program is at most d = 2εkℓ/10. We will show

that the success probability of B is at most 1100 + o(1).

Let

ℓ1 =ℓ100

and

ℓ2 = ℓ.

3.4.1 The Truncated Path

Below, wewill make some changes in the branching programB. Wewill denote by B the resulting

branching program. Let v0 be the start vertex of B. We will denote by v1 the vertex reached at the

end of the first part of B. Note that v1 is a random variable, that depends on x, a1, . . . , am.

In the resulting branchingprogram B, wewill have the property that the vertex v1 reached at the

end of the first part of the program is remembered by every future vertex v. That is, every vertex

v, in the second part of the program, remembers which vertex the path that led to v reached, at

the end of the first part of the program. Formally, this information is stored in Sv.

Below, we will define the truncated-path, T , to be the same as the computation-path of the

new branching program, except that it sometimes stops before reaching a leaf. Roughly speak-

ing, T stops before reaching a leaf if certain “bad” events occur. Nevertheless, we show that the

probability that T stops before reaching a leaf is small, so we can think of T as essentially iden-

tical to the computation-path. The decision of whether or not T stops on a given vertex v in

layer-i of part-j will depend on v, Sv, x, ai+1. For that reason, we are able to consider the path T ,

starting from any vertex v (without knowing the history of the path that led to v, except for the

99

information stored in Sv).

Let vbe a vertex in the secondpart of the program. The vertex v remembers (in Sv) the vertex v1.

We denote by v1 → v the event that the path T that starts from v1 reaches v (where v1 is the vertex

at the end of the first part of the program that v remembers, and the event is over x, a1, . . . , am).

More generally, for two vertices w1,w2 in the program, we denote by w1 → w2 the event (over

x, a1, . . . , am) that the path T that starts from w1 reaches w2. In particular, v0 → v is the event

that the path T that starts from the start vertex v0 reaches v.

We change the original branching program B as follows:

First Part:

We define stopping rules for the first part as defined in [Raz17, GRT18] for one-pass programs,

as if the first part were the entire program. Next, we describe these rules formally.


We say that a vertex v in layer-i of the first part of the program is significant if

∥Px|v0→v∥2 > 2ℓ1 · 2−n.

Significant Values

Even if v is not significant, Px|v0→v may have relatively large values. For a vertex v in layer-i of the

first part of the program, denote by Sig(v) the set of all x′ ∈ X, such that,

Px|v0→v(x′) > 24ℓ · 2−n.

100

Bad Edges

For a vertex v in layer-i of the first part of the program, denote by Bad(v) the set of all a ∈ A,

such that, ∣∣(M · Px|v0→v)(a)∣∣ ≥ 2−2r.

The Truncated-Path T on the First Part

We define T on the first part, by induction on the layers. Assume that we already defined T

until it reaches a vertex v in layer-i of the first part. The path T stops on v if (at least) one of the

following occurs:


2. x ∈ Sig(v).

3. ai+1 ∈ Bad(v).

Otherwise, (unless i = m) T proceeds by following the edge labeled by (ai+1, bi+1) (same as the

computational-path).

Second Part:

We denote by v1 the vertex in layer-m (that is, the last layer of the first part of the program) that is

reached by T . Note that v1 is a random variable that depends on x, a1, . . . , am. We denote by d1

the number of vertices in layer-m. We assume without loss of generality that each vertex in layer-

m is reached with probability of at least 2−10r ·d−11 , as vertices reached with negligible probability

can be ignored. Formally, if we reach a vertex in layer-m, such that, the probability to reach that

vertex is smaller than 2−10r · d−11 , the path T stops.

We update the second part so that every vertex v in the second part “remembers” v1. This

information is stored in thememory Sv. Formally, this means that starting from every possible v1,

we have a separate copy of the entire second part of the program.

101

We then change the second part so that it is now the product (see definition 3.1.2) of the first

part (after it was changed as described above) and the second part. Intuitively, this means that the

second part runs a copy of the first part of the computation, in parallel to its own computation.

Next, we define stopping rules for the secondpart, by induction over the layers, and at the same

time (by the same induction), we also define for each vertex v, a listLv of indices i1, . . . , id(v) ∈ [m]

that the vertex v remembers (that is, the list Lv is stored in the memory Sv). Once an index was

added to Lv, it is remembered in all the future, that is, for every vertex u reached from v (in the

second part of the program), we have Lv ⊆ Lu. Note that the stopping rules are defined for the

updated second part (as described above).

The stopping rules for the second part extend the stopping rules in the case of one-pass pro-

grams, as defined in [Raz17, GRT18], as if the second part were the entire program, with start-

ing vertex v1. However, the third stopping rule (bad edges) is now different. We say that an

edge (v, u), labelled by (a, b), is of high probability if the probability to sample a, condition-

ing on reaching v from v0 (that is, reaching v from the starting vertex of the entire program) is

larger than 2k · 2−n′ . That is, if v is in layer-i of the second part, (v, u) is of high probability if

Pr[ai+1 = a | v0 → v] ≥ 2k · 2−n′ . The third stopping rule is changed so that T doesn’t stop

on a bad edge if it is of high probability. Instead, if T traverses such an edge, we “remember” the

time step in which T traversed that edge, in all the future. That is, we enter the index i toLu (and

remember it in all the future, until the end of the program). In addition, we add a stopping rule

that stops if the edge is “very-bad” and a stopping rule that stops if the number of indices in Lv is

too large, that is, if the number of high-probability edges that were already traversed is too large.

Next, we describe these rules formally. We initiate Lv1 = ∅.


We say that a vertex v in layer-i of the second part of the program is significant if

∥Px|v1→v∥2 > 2ℓ2 · 2−n.

102

Significant Values

For a vertex v in layer-i of the second part of the program, denote by Sig(v) the set of all x′ ∈ X,

such that,

Px|v1→v(x′) > 24ℓ · 2−n.

Bad Edges

For a vertex v in layer-i of the second part of the program, denote by Bad(v) the set of all a ∈ A,

such that, ∣∣(M · Px|v1→v)(a)∣∣ ≥ 2−2r.

Very-Bad Edges

For a vertex v in layer-i of the second part of the program, denote by VeryBad(v) the set of all

(a, b) ∈ A× {−1, 1}, such that,

Prx[M(a, x) = b | v1 → v] ≤ 2−4ℓ.

High-Probability Edges

For a vertex v in layer-i of the second part of the program, denote byHigh(v) the set of all a ∈ A,

such that,

Pr[ai+1 = a | v0 → v] ≥ 2k · 2−n′ .

The Truncated-Path T on the Second Part

We define T on the second part, by induction on the layers. Assume that we already defined T

until it reaches a vertex v in layer-i of the second part (and we already defined Lv). The path T

stops on v if (at least) one of the following occurs:

103


2. x ∈ Sig(v).

3. ai+1 ∈ Bad(v) \High(v).

4. (ai+1, bi+1) ∈ VeryBad(v).

5. |Lv| ≥ 200εℓ.

6. Recall that we changed the second part of the program so that it is the product of the first

part and the (original) second part. This means that the second part of the program runs

its own copy of the first part of the program. If the path T , that was defined for the first

part, stops on the copy of the first part that the second part runs, the path T stops on the

vertex v too.

Remark 3.4.1. We note that if T stopped on the first part, it couldn’t have reached v1 in

the first place. Thus, conditioned on the event v0 → v1, the path T didn’t stop on the first

part. Therefore, conditioned on the event v0 → v1, the path T never stops because of stopping

rule 6. Thus, this stopping rule is not necessary. Nevertheless, we add this stopping rule for

completeness, so that it would be possible to consider the path T starting from any vertex (even

in the middle of the program), without conditioning on the event of reaching that vertex.

Otherwise, unless T already reached the end of the second part, T proceeds by following the

edge labeled by (ai+1, bi+1) (same as the computational-path). Let (v, u) be the edge labeled

by (ai+1, bi+1). It remains to define the list Lu.

Updating Lu

Let (v, u) be the traversed edge, labeled by (ai+1, bi+1). If the traversed edge (v, u) is not a high-

probability edge, that is, if ai+1 ∈ High(v), we define Lu = Lv.

104

If the traversed edge (v, u) is a high-probability edge, that is, if ai+1 ∈ High(v), we define

Lu = Lv ∪ {i + 1}, and hence, by induction, Lu is the list of all indices corresponding to the

high-probability edges that T traversed, in the second part of the program, until reaching u.

3.4.2 Bounding theWidth of the Branching Program B

From now on, we will only consider the final branching program, B.

The final branching program, B, has a larger width than the original one. The main contri-

butions to the larger width is that we changed the second part to be the product of the first and

second parts of the original program and that each vertex in the second part of B remembers the

vertex v1 (reached at the end of the first part). This multiplies the memory needed (that is, the

logarithm of the width of the program) by a factor of at most 3. In addition, each vertex v has to

remember Lv, but by Equation (3.1) and since T stops when |Lv| ≥ 200εℓ, this adds memory of

at most εℓr100 . Thus, the final width of B is at most 2εkℓ/2.

3.4.3 The Probability that T Stops is Small

We will now prove that the probability that T stops before reaching a leaf is at most 1100 + o(1).

Lemma 3.4.2. The probability that T stops before reaching a leaf is at most 1100 + o(1).

Proof. First, recall that if T reaches a vertex in layer-m, such that, the probability to reach that

vertex is smaller than 2−10r ·d−11 , then T stops. By the union bound, the probability that T stops

because of this rule is at most 2−10r = o(1).

We will now bound the probability that T stops because of each of the other stopping rules.

Recall that for two vertices w1,w2 in the program, we denote by w1 → w2 the event (over

x, a1, . . . , am) that the path T that starts from w1 reaches w2.

Stopping Rule 1: Significant-Vertices

Lemma 3.4.3. The probability that T reaches a significant vertex is at most o(1).

105

Lemma 3.4.3 is proved in Section 3.5.

Next, we will bound the probability that T stops because of each of the other stopping rules.

ByLemma3.4.3, it’s sufficient to bound these probabilities, under the assumption thatT doesn’t

reach any significant vertex (as otherwise, T would have stopped because of stopping rule 1).

Stopping Rule 2: Significant-Values

We will now bound the probability that T stops because of stopping rule 2. We will first prove

the following claim.

Claim 3.4.1. If v is a non-significant vertex in layer-i of part-j (where j ∈ {1, 2}), then

Prx[x ∈ Sig(v) | vj−1 → v] ≤ 2−2ℓ.


Ex′∼Px|vj−1→v

[Px|vj−1→v(x′)

]=∑x′∈X

[Px|vj−1→v(x′)2

]= 2n · E

x′∈RX

[Px|vj−1→v(x′)2

]≤ 22ℓj · 2−n.


Prx′∼Px|vj−1→v

[Px|vj−1→v(x′) > 24ℓ · 2−n

]≤ 22ℓj−4ℓ ≤ 2−2ℓ.

Since conditioned on the event vj−1 → v, the distribution of x is Px|vj−1→v, we obtain

Prx

[x ∈ Sig(v)

∣∣ vj−1 → v]= Pr

x

[(Px|vj−1→v(x) > 24ℓ · 2−n

) ∣∣ vj−1 → v]≤ 2−2ℓ.

By Claim 3.4.1, if v is a non-significant vertex in layer-i of part-j then

Prx[x ∈ Sig(v) | vj−1 → v] ≤ 2−2ℓ ≤ 2−4ℓ. (3.3)

106

We need to bound from above

Ev

[Prx[x ∈ Sig(v) | v0 → v]

], (3.4)

where the expectation is over the non-significant vertices v in layer-i of part-j, reached by the path

T . (If T stops before reaching layer-i of part-j, or if it reaches a significant vertex, we think of v

as undefined and think of the inner probability as 0). If j = 1, we are done by Claim 3.4.1. We

will proceed with the case j = 2. Recall that ℓ2 = ℓ.

We will use the following lemma, whose proof is deferred to the next subsection. We shall

instantiate the lemma by setting Sv = Sig(v).

Lemma 3.4.4. Assume that for every non-significant vertex v in layer-i of part-2, we have some

subset of valuesSv ⊆ Xthat depends only on v. Assume that for every such v (with positive probability

for the event v1 → v, where v1 is the vertex recorded by v), we have

Prx[x ∈ Sv | v1 → v] ≤ 2−4ℓ.

Then,

Ev[Prx[x ∈ Sv | v0 → v]] < 2−Ω(ℓ) (3.5)

where the expectation is over the non-significant vertices v in layer-i of part-2, reached by the path

T . (If T stops before reaching layer-i of part-2, or if it reaches a significant vertex, we think of v as

undefined and think of the inner probability as 0).

By Expression (3.3), the assumption of the lemma is satisfied by the choiceSv = Sig(v). Thus,

the conclusion of the lemma implies that

Ev

[Prx[x ∈ Sig(v) | v0 → v]

]≤ 2−Ω(ℓ).

Thus, the probability that T stops because of stopping rule 2 is at most 2−Ω(ℓ), in each step,

107

and taking a union bound over the length of the program, the probability that T stops because

of stopping rule 2 is at most 2−Ω(ℓ).


Proof. We could also write Ev[Prx[x ∈ Sv | v0 → v]] as

∑v∈Li,2

Pr[v0 → v] · Prx[x ∈ Sv | v0 → v] =

∑v∈Li,2

Pr[(x ∈ Sv) ∧ (v0 → v)]

where Li,2 denotes the non-significant vertices v in layer-i of part-2, that are reachable (with

probability larger than 0) from the start vertex.

Later on, we will define for every v ∈ Li,2, an event Gv that will occur with high probability.

We will denote by Gv, the complement ofGv. We will bound

∑v∈Li,2

Pr[(x ∈ Sv) ∧ (v0 → v)],

by bounding separately

∑v∈Li,2

Pr[Gv ∧ (x ∈ Sv) ∧ (v0 → v)] (3.6)

and ∑v∈Li,2

Pr[Gv ∧ (x ∈ Sv) ∧ (v0 → v)] (3.7)

The second expression will be bounded by

∑v∈Li,2

Pr[Gv ∧ (v0 → v)], (3.8)

that will be at most 2−Ω(ℓ) (see Claim 3.4.2). Thus, we will focus first on bounding Expres-

108

sion (3.6), which is equal to

∑v∈Li,2

∑x′∈Sv

Pr[Gv ∧ (x = x′) ∧ (v0 → v)] (3.9)

=∑v∈Li,2

∑x′∈Sv

Pr[Gv ∧ (v0 → v) | (x = x′)] · Pr[x = x′]. (3.10)



Recall that by the construction of the branching-program B, part-2 runs a copy of part-1 of

the computation. Thus, the vertex v has a corresponding vertex v′ in layer-i of part-1, such that,

if the path T reached v it previously reached v′. Recall also that v remembers v1, so if the path T

reached v it previously reached v1.

Thus, the event v0 → v is equivalent to v0 → v′ → v1 → v, that is, the event

(v0 → v′) ∧ (v′ → v1) ∧ (v1 → v).

Since the second part of the program runs a copy of the first part, the event v1 → v implies the

event v0 → v′. Hence, the event v0 → v is equivalent to the event

(v′ → v1) ∧ (v1 → v).

Note also that if we fix x, that is, if we condition on x = x′, and we fix v (which also fixes v′, v1)

the events (v′ → v1) and (v1 → v) are independent (as the first one depends only on ai+1, . . . , am

and the second depends only on a1, . . . , ai). We will also have the property that the eventGv is a

function of v′ rather than v, and hence will also be denoted byGv′ = Gv (recall that v determines

v′). Moreover, if we fix x and v′, we will have the property that the event Gv′ depends only on

ai+1, . . . , am, and hence the eventsGv′ and (v′ → v1) are independent of (v1 → v).

109

Thus, for a fixed v (which also fixes v′, v1) and any x′ ∈ X,

Pr[Gv ∧ (v0 → v) | x = x′]

= Pr[Gv′ ∧ (v′ → v1) ∧ (v1 → v) | x = x′]

= Pr[Gv′ ∧ (v′ → v1) | x = x′] · Pr[v1 → v | x = x′].

We introduce the event (v′ → v1) to indicate that the computational path from v′ reached v1

(as opposed to the usual notation that denotes the truncated path). Since (v′ → v1) implies

(v′ → v1)we have

Pr[Gv′ ∧ (v′ → v1) | x = x′] · Pr[v1 → v | x = x′]

≤ Pr[Gv′ ∧ (v′ → v1) | x = x′] · Pr[v1 → v | x = x′].

By Bayes’ rule, the last expression is at most

Pr[x = x′ | Gv′ ∧ (v′ → v1)] · Pr[x = x′ | v1 → v] · Pr[v′ → v1] · Pr[v1 → v]Pr[x = x′]2

= Px|Gv′∧(v′ → v1)(x′) · Px|v1→v(x′) ·Pr[v′ → v1] · Pr[v1 → v]

Pr[x = x′]2.

Thus, Expression (3.10) is at most

∑v∈Li,2

(Pr[v′ → v1] · Pr[v1 → v] ·

∑x′∈Sv

Px|Gv′∧(v′ → v1)(x′)Pr[x = x′]

· Px|v1→v(x′)

). (3.11)

Note that from layer-iofpart-1 to layer-mofpart-1, thebranchingprogram is one-pass. Denote

by Rv′ the one-pass branching program, from layer-i of part-1 to layer-m of part-1, with starting

vertex v′. Thus, we can use what we already know about one-pass branching programs. We will

apply a slightmodificationof themain theoremof [GRT18] (PropositionA.0.1 fromAppendix),

for one-pass branching programs, with parameters k′ = k, ℓ′ = ℓ, r′ = r/4.

110

Asm ≤ 2εr and Rv′ has width at most 2εkℓ/2 ≤ 2k′·ℓ′/100 (ε is small enough), by Proposition

A.0.1, we know that for anyfixed v′, there exists an eventGv′ that depends only on x, ai+1, . . . , am,

such that, Pr(Gv′) ≥ 1 − 2−ℓ/8 (ℓ ≤ k), and for every x′ ∈ X, and every v1 such that Pr[Gv′ ∧

(v′ → v1)] > 0 it holds that

Px|Gv′∧(v′ → v1)(x′) ≤ 22ℓ · 2−n.

Namely, the eventGv′ is the eventG fromPropositionA.0.1 corresponding to the branching pro-

gramRv′ (that is, the eventGv′ is the event that the truncated-path as defined for one-pass branch-

ing programs in [GRT18] with slight modification, didn’t stop because of one of the stopping

rules, until the last layer, and didn’t violate the significant vertices and significant values stopping

rules in the last layer, that is, layer-m of part-1).

Substituting this in Expression (3.11), we get that the expression is at most

22ℓ ·∑v∈Li,2

(Pr[v′ → v1] · Pr[v1 → v] ·

∑x′∈Sv

Px|v1→v(x′)

). (3.12)

By the assumption of the lemma, for any v ∈ Li,2 we have∑

x′∈Sv Px|v1→v(x′) ≤ 2−4ℓ, thus

Expression (3.12) is at most

2−2ℓ ·∑v∈Li,2

Pr[v′ → v1] · Pr[v1 → v].

Recall thatLi,2 denotes only the vertices v in layer-i of part-2, that are reachable (with probability

larger than 0) from the start vertex, v0. Recall that the event (v1 → v) is equivalent to the event

(v0 → v′) ∧ (v1 → v).

Thus,

∑v∈Li,2

Pr[v′ → v1] · Pr[v1 → v] ≤∑v′,v1,v

Pr [v′ → v1] · Pr [(v0 → v′) ∧ (v1 → v)]

111

=∑v′,v1

Pr [v′ → v1] ·(∑

v

Pr [(v0 → v′) ∧ (v1 → v)])

≤∑v′,v1

Pr [v′ → v1] · Pr [v0 → v′]

=∑v′

Pr [v0 → v′] ·(∑

v1

Pr [v′ → v1])

≤∑v′

Pr [v0 → v′] ≤ 1

(where the possible inequality in the first line is because the first sum is on all the paths v0 →

v′ → v1 → v, obtained with positive probabilities, whereas the second sum is on all possible

vertices v0, v′, v1, v in the corresponding layers of the branching program).

Thus, we conclude that Expression (3.6) is atmost 2−2ℓ. It remains to boundExpression (3.8).

Claim 3.4.2. ∑v∈Li,2

Pr[Gv ∧ (v0 → v)] ≤ 2−Ω(ℓ).

Proof.

∑v∈Li,2

Pr[Gv ∧ (v0 → v)] =∑v∈Li,2

Pr[Gv ∧ (v0 → v′) ∧ (v′ → v1) ∧ (v1 → v)]

≤∑v′,v1,v

Pr[Gv ∧ (v0 → v′) ∧ (v′ → v1) ∧ (v1 → v)]

=∑v′,v1,v

Pr[Gv′ ∧ (v0 → v′) ∧ (v′ → v1) ∧ (v1 → v)]

=∑v′∈Li,1

Pr[Gv′ ∧ (v0 → v′)] ·∑v1,v

Pr[(v′ → v1) ∧ (v1 → v)|Gv′ ∧ (v0 → v′)]

≤∑v′∈Li,1

Pr[Gv′ ∧ (v0 → v′)]. (3.13)

112

For every non-significant v′ ∈ Li,1, denote by

Xv′ = {x′ : Px|v0→v′(x′) ≥ 2ℓ/16 · 2−n},

and split the expressionPr[Gv′ ∧ (v0 → v′)] according to whether or not (x ∈ Xv′).

Pr[Gv′ ∧ (v0 → v′)]

≤ Pr[(v0 → v′) ∧ (x ∈ Xv′)] + Pr[Gv′ ∧ (v0 → v′) ∧ (x /∈ Xv′)] (3.14)

We begin by bounding the first summand in Expression (3.14):

Pr[(v0 → v′) ∧ (x ∈ Xv′)] = Pr(v0 → v′) · Pr[(x ∈ Xv′)|v0 → v′]

We bound Pr[x ∈ Xv′ |v0 → v′] very similarly to the proof of Claim 3.4.1, but with a different

threshold. Since v′ is not significant,

Ex′∼Px|v0→v′

[Px|v0→v′(x′)

]=∑x′∈X

[Px|v0→v′(x′)2

]= 2n · E

x′∈RX

[Px|v0→v′(x′)2

]≤ 22ℓ1 · 2−n.


Pr[x ∈ Xv′ |v0 → v′] = Prx′∼Px|v0→v′

[Px|v0→v′(x′) ≥ 2ℓ/16 · 2−n

]≤ 22ℓ1−ℓ/16 ≤ 2−ℓ/32

(recall that ℓ1 = ℓ/100). Overall, we bounded the first summand in Expression (3.14) by

Pr(v0 → v′) · 2−ℓ/32.

Next, we bound the second summand in Expression (3.14).

Pr[Gv′ ∧ (v0 → v′) ∧ (x /∈ Xv′)

]

113

=∑

x′∈X\Xv′

Pr(x = x′) · Pr[Gv′ ∧ (v0 → v′) | x = x′

].

Since if we fix x and v′, the event Gv′ depends only on ai+1, . . . , am and hence is independent of

(v0 → v′), we have

∑x′∈X\Xv′

Pr(x = x′) · Pr[Gv′ ∧ (v0 → v′) | x = x′

]

=∑

x′∈X\Xv′

Pr(x = x′) · Pr[Gv′ | x = x′

]· Pr

[v0 → v′ | x = x′

]= Pr

(v0 → v′

)·∑

x′∈X\Xv′

Pr[Gv′ | x = x′

]· Pr

[x = x′ | v0 → v′

](by Bayes’ rule)

= Pr(v0 → v′

)·∑

x′∈X\Xv′

Pr[Gv′ | x = x′

]· Px|v0→v′(x′)

≤ Pr(v0 → v′

)·∑

x′∈X\Xv′

Pr[Gv′ | x = x′

]· 2ℓ/16 · 2−n

(by the definition ofXv′)

≤ Pr(v0 → v′

)· Pr

(Gv′)· 2ℓ/16

≤ Pr(v0 → v′

)· 2−ℓ/8 · 2ℓ/16

≤ Pr(v0 → v′

)· 2−ℓ/16.

Substituting in Expression (3.14), we have

Pr[Gv′ ∧ (v0 → v′)] ≤ Pr(v0 → v′

)· 2−ℓ/32 + Pr

(v0 → v′

)· 2−ℓ/16.

114

Substituting in Expression (3.13), we have

∑v∈Li,2

Pr[Gv ∧ (v0 → v)] ≤ (2−ℓ/32 + 2−ℓ/16) ·∑v′∈Li,1

Pr(v0 → v′

)≤ 2 · 2−ℓ/32.

This finishes the proof of Lemma 3.4.4.

Stopping Rule 3: Bad-Edges

We will now bound the probability that T stops because of stopping rule 3. We will first prove

the following claim.

Claim 3.4.3. If v is a non-significant vertex in layer-i of part-j (where j ∈ {1, 2}), then

Prai+1

[ai+1 ∈ Bad(v)] ≤ 2−4k.

Proof. Since v is not significant, ∥Px|vj−1→v∥2 ≤ 2ℓj · 2−n ≤ 2ℓ · 2−n. Since Px|vj−1→v is a distri-

bution, ∥Px|vj−1→v∥1 = 2−n. Thus,

∥Px|vj−1→v∥2∥Px|vj−1→v∥1

≤ 2ℓ.

SinceM is a (10k, 10ℓ)-L2-extractor with error 2−10r, there are at most 2−10k · |A| elements a ∈ A

with ∣∣⟨Ma,Px|vj−1→v⟩∣∣ ≥ 2−10r · ∥Px|vj−1→v∥1 = 2−10r · 2−n

The claim follows since ai+1 is uniformly distributed over A.

By Claim 3.4.3, if v is a non-significant vertex then

Prai+1

[ai+1 ∈ Bad(v)] ≤ 2−4k.

115

We need to bound

Prai+1

[ai+1 ∈ Bad(v) \High(v) | v0 → v].

We bound

Prai+1

[ai+1 ∈ Bad(v) \High(v) | v0 → v] =∑

a∈Bad(v)\High(v)

Pr[ai+1 = a | v0 → v]

≤∑

a∈Bad(v)\High(v)

2k · 2−n′ ≤ 2k ·∑

a∈Bad(v)

Pr[ai+1 = a]

= 2k · Prai+1

[ai+1 ∈ Bad(v)] ≤ 2k · 2−4k = 2−3k.

Thus, the probability that T stops because of stopping rule 3 is at most 2−3k, in each step, and

taking a union bound over the length of the program, the probability that T stops because of

stopping rule 3 is at most 2−2k.

Stopping Rule 4: Very-Bad Edges

We will now bound the probability that T stops because of stopping rule 4.

Recall that for a vertex v in layer-i of part-2 of the program, VeryBad(v) is the set of all (a, b) ∈

A× {−1, 1}, such that,

Prx[M(a, x) = b | v1 → v] ≤ 2−4ℓ.

Note that for every a ∈ A, there is at most one b ∈ {−1, 1}, denoted bv(a), such that

Prx[M(a, x) = b | v1 → v] ≤ 2−4ℓ.

If such a b doesn’t exist we let bv(a) = ∗, and think of it as undefined. Thus, for every v, and

every (a, b) ∈ A× {−1, 1},

((a, b) ∈ VeryBad(v)

)⇐⇒

(b = bv(a)

), (3.15)

116

and

Prx[M(a, x) = bv(a) | v1 → v] ≤ 2−4ℓ. (3.16)

Let av ∈ A be an a ∈ A, such that Prx[M(a, x) = bv(a) | v0 → v] is maximal and let

bv = bv(av). We need to bound from above

Ev

[Pr[(ai+1, bi+1) ∈ VeryBad(v) | v0 → v]

], (3.17)

where the expectation is over the vertex v in layer-i of part-2, reached by the path T . (If T stops

before reaching layer-i of part-2, we think of v as undefined and think of the inner probability

as 0). That is, we could also write Expression (3.17) as

∑v∈Li,2

Pr[v0 → v] · Pr[(ai+1, bi+1) ∈ VeryBad(v) | v0 → v],

whereLi,2 denotes the vertices v in layer-i of part-2, that are reachable (with probability larger

than 0) from the start vertex. By Equation (3.15), Expression (3.17) is equal to

Ev

[Pr[bi+1 = bv(ai+1) | v0 → v]

],

which, by the definition of bi+1, is equal to

Ev

[Pr[M(ai+1, x) = bv(ai+1) | v0 → v]

],

which, by the definitions of av, bv, is at most

Ev

[Pr[M(av, x) = bv | v0 → v]

]. (3.18)

In what follows, we assume for simplicity and without loss of generality that for every v, bv ∈

{−1, 1} is defined (as otherwise Pr[M(av, x) = bv | v0 → v] = 0 and can be omitted from the

117

expectation).

For any fixed v, denote by Sv = {x : M(av, x) = bv}. We can apply Lemma 3.4.4, since from

Expression (3.16) for any non-significant v

Pr[x ∈ Sv | v1 → v] ≤ 2−4ℓ.

Thus, we get

Ev[Pr[x ∈ Sv | v0 → v]] ≤ 2−Ω(ℓ) ,

and since (x ∈ Sv) ⇐⇒ (M(av, x) = bv), we have

Ev[Pr[M(av, x) = bv | v0 → v]] ≤ 2−Ω(ℓ) .

Finally, by the definitions of av and bv we have

Ev[Pr[(ai+1, bi+1) ∈ VeryBad(v) | v0 → v]] ≤ E

v[Pr[M(av, x) = bv | v0 → v]] ≤ 2−Ω(ℓ) .

Thus, the probability that T stops because of stopping rule 4 is at most 2−Ω(ℓ), in each step,

and taking a union bound over the length of the program, the probability that T stops because

of stopping rule 4 is at most 2−Ω(ℓ).

Stopping Rule 5: Large Lv



Recall that v1 is the vertex reached by the path at the end of part-1. Fix v1 and denote by E the

event v0 → v1. Let u0, u1, . . . , um be the vertices reached by the path in part-2, where u0 = v1.

(If the path stops before reaching layer-i of part-2, we define ui to be a special stop vertex in that

layer). Note that conditioned on the event E, the random variable ui is a function of x, a1, . . . , ai

118

and for i ≥ 1 it can also be viewed as a function of x, ui−1, ai.

Denote by T the number of high-probability edges that the path traverses in part-2. For every

i ∈ [m], letTi ∈ {0, 1} be an indicator random variable that indicates whether the path traverses

a high-probability edge at step-i of part-2. Thus,

T =m∑i=1

Ti.

For every i ∈ [m], we have that Ti = 1 only if ai ∈ High(ui−1), that is, only if Pr(ai|ui−1,E) ≥

2k · 2−n′ , or equivalentlylog(2n′ · Pr(ai|ui−1,E)

)k

≥ 1.

Claim 3.4.4. Let Z ∈ {0, 1}n′ be any random variable. Let k ≥ 4. Let T(Z) ∈ {0, 1} be an

indicator random variable for the event Pr(Z) ≥ 2k · 2−n′ . Then,

2 · EZ

[log(2n′ · Pr(Z)

)k

]≥ E

Z[T(Z)].

Proof. Let α = PrZ(T(Z) = 1). That is, we havePr(Z) ≥ 2k · 2−n′ with probability α. Thus,

EZ

[log(2n′ · Pr(Z)

)k

]=

α · EZ

[log(2n′ · Pr(Z)

)k

∣∣∣∣∣ T(Z) = 1

]+ (1− α) · E

Z

[log(2n′ · Pr(Z)

)k

∣∣∣∣∣ T(Z) = 0

].

By the monotonicity of the logarithm function, we have,

α · EZ

[log(2n′ · Pr(Z)

)k

∣∣∣∣∣ T(Z) = 1

]≥

α · EZ

[log(2n′ · 2k · 2−n′

)k

∣∣∣∣∣ T(Z) = 1

]= α

By the monotonicity of the logarithm function and the concavity of the entropy function, we

119

have,

(1− α) · EZ

[log(2n′ · Pr(Z)

)k

∣∣∣∣∣ T(Z) = 0

]≥

(1− α) · EZ

[log(2n′ · (1− α) · 2−n′

)k

∣∣∣∣∣ T(Z) = 0

]=

(1− α) log(1− α)k

(as, by the concavity of the entropy function, the expression is minimized when the random vari-

able Z|(T(Z) = 0) is uniformly distributed).

Thus, the left hand side of the claim is at least

2α +2(1− α) log(1− α)

k≥ 2α − 4α

k≥ 2α − 4α

4= α.

The claim follows Since EZ[T(Z)] = α.

By Claim 3.4.4,

Ex,a1,...,am

[T|E] =m∑i=1

Ex,a1,...,am

[Ti|E] ≤ 2 ·m∑i=1

Ex,a1,...,am

[log(2n′ · Pr(ai|ui−1,E)

)k

]

=2k·

(mn′ −

m∑i=1

H(ai|ui−1,E)

),

whereH denotes the entropy function. Since conditioning may only decrease the entropy, the

last expression is at most

≤ 2k·

(mn′ −

m∑i=1

H(ai|x, ui−1,E)

).

Since, conditioned on E, the random variable ui−1 is a function of x, a1, . . . , ai−1, by the data-

processing inequality,H(ai|x, ui−1,E) ≥ H(ai|x, a1, . . . , ai−1,E), and hence the last expression

120

is at most

≤ 2k·

(mn′ −

m∑i=1

H(ai|x, a1, . . . , ai−1,E)

).

By the chain rule, the last expression is equal to

=2k·(mn′ −H(a1, . . . , am|x,E)

)=

2k·(mn′ −H(x, a1, . . . , am|E) +H(x|E)

)≤ 2

k·(mn′ + n−H(x, a1, . . . , am|E)

)≤ 2

k· log

(1

Pr(E)

).

Thus,

Ex,a1,...,am

[T|E] ≤ 2k· log

(1

Pr(E)

).

ByMarkov inequality

Prx,a1,...,am

[T ≥ 200

k· log

(1

Pr(E)

) ∣∣∣∣∣E]≤ 1

100.

Since we assumed that Pr(E) ≥ 2−10r · d−11 and since the width of B is at most 2εkℓ/2 and

since by Equation (3.1) and Equation (3.2), r is negligible compared to εkℓ/2, we have that

log(

1Pr(E)

)≤ εkℓ. Hence,

Prx,a1,...,am

[T ≥ 200εℓ

∣∣∣∣ E] ≤ 1100

.

Thus, the probability to stop on part-2 because of stopping rule 5 is at most 1100 .

121

Stopping Rule 6: Consistency-Stop

We will now show that the probability that T stops on a vertex v, in layer-i of part-2, because of

stopping rule 6, conditioned on the event v0 → v, is 0.

Recall that by the construction of the branching-program B, part-2 runs a copy of part-1 of

the computation. Thus, the vertex v has a corresponding vertex v′ in layer-i of part-1, such that,

if the path T reached v it previously reached v′.

If T needs to stop on v, because of stopping rule 6, because T stopped on the vertex v′, it

couldn’t have reached v in the first place (as it would have stopped on v′). Thus, conditioned on

the event v0 → v, the path T didn’t stop on v′ and doesn’t need to stop on v because of stopping

rule 6.

Thus, the probability that T stops because of stopping rule 6 is 0.

This completes the proof of Lemma 3.4.2.

3.4.4 The Final Success Probability is Small

Let v be a vertex in the last layer of the program. Assume that the probability for the event v0 → v

is larger than 0. Since v is in the last layer, the event v0 → v is equivalent to v1 → v (since the

second part of the program runs a copy of the first part). Hence,

Px|v0→v = Px|v1→v

and

Pr[v0 → v] = Pr[v1 → v].

In particular, if v is not significant, Px|v0→v has small L2-norm.

Ex′∈RX

[Px|v0→v(x′)2

]≤ 22ℓ · 2−2n.

122

Hence, for every x′ ∈ X,

Pr[x = x′ | v0 → v] = Px|v0→v(x′) ≤ 2ℓ · 2−n/2 ≤ 2−n/4

In particular,

Pr[x(v) = x | v0 → v] ≤ 2−n/4.

Thus, either the computation path stops before reaching vwhich happens with probability at

most 1100 + o(1) or it reaches a non-significant vertex where the probability of guessing correctly

is o(1). Thus, the final success probability is bounded by 1100 + o(1). This completes the proof of

Theorem 3.2.1.

3.5 Probability of Stopping at Significant Vertices

In this section, we prove Lemma 3.4.3.

ProofOverview. Let s be a significant vertex in part-j (that remembers the vertices visited at

the end of parts 1, . . . , j − 1, denoted by s1, . . . , sj−1). Assume that the probability for the event

v0 → s is larger than 0. We need to bound from above the probability for the event v0 → s. Since

the event v0 → s is equivalent to (v0 → sj−1) ∧ (sj−1 → s), it suffices to bound from above the

probability for (sj−1 → s). Note that to analyze this probability we can ignore all parts of the

program, except for part-j, which is a one-pass branching program.

We would like to reprove Lemma 4.1 of [GRT18], with the updated stopping rules. In the

definition of the progress functionZi, we will take the sum only on vertices u ∈ Li,j, such that s

can be reached from u (and in the same way for edges in the definition ofZ ′i ). In particular, this

implies that every index in Lu is contained in Ls (as otherwise s cannot be reached from u).

The progress function is still small at the beginning and large at the end, so as before the main

thing to do is to prove that it grows slowly. This was done in Claim 4.10 of [GRT18].

The main difference here is that the progress function doesn’t grow slowly for every edge, as

123

some edges are now bad, and we have to take the bad edges into account. We separate to time

steps that are in Ls and time steps that are not in Ls. For time steps that are not in Ls, we don’t

need to count the bad edges at all, as they are not recorded byLs and hence s is not reachable from

these edges.

As for steps in Ls, we know that the edges are not very-bad, and we show that the progress

function may increase by a factor of at most 25ℓk. Since |Ls| ≤ 200εℓ (as otherwise T would

have stopped by stopping rule 5), the total effect of the bad edges on the progress function is a

factor of at most 25ℓk·200εℓ ≤ 21000εkℓ,which we can afford.


Proof. We need to prove that the probability that T reaches any significant vertex is o(1). Let

s be a significant vertex in part-j. Assume that the probability that T reaches s is larger than 0.

We will bound from above the probability that T reaches s, and then use a union bound over all

significant vertices of B. Since the event v0 → s is equivalent to (v0 → sj−1)∧(sj−1 → s), it suffices

to bound from above the probability for (sj−1 → s). Note that to analyze this probability we can

ignore all parts other than j of the program, which leaves us with a one-pass branching program.

Furthermore, since s determines sj−1, we can only consider the subprogram that starts at sj−1 and

analyze the probability that the restriction of T to this subprogram reaches s. We denote by B′

the subprogram of B restricted to the j-part with sj−1 as the starting node.


For a vertex v in B′, we denote by Ev the event that T starting from sj−1 reaches the vertex v.

For simplicity, we denote by Pr(v) = Pr(Ev) the probability for Ev (where the probability is

over x, a1, . . . , am), and we denote by Px|v = Px|Ev the distribution of the random variable x

conditioned on the event Ev.

Similarly, for an edge e of the branching program B′, let Ee be the event that T starting from

sj−1 traverses the edge e. Denote, Pr(e) = Pr(Ee) (where the probability is over x, a1, . . . , am),

124

and Px|e = Px|Ee .

Notation: B′ inherits the definitions of significant vertices, Sig(v), Bad(v), VeryBad(v) and

High(v) from B. Note that significant vertices, Sig(v), Bad(v) and VeryBad(v) are defined con-

ditioned on the event vj−1 → v, which is equivalent to the event Ev. Recall that the walk T does

not stop on an edge (v, u)marked (a, b) if a ∈ High(v), as long as (a, b) /∈ VeryBad(v). We will

use the following fact on T : if i /∈ Ls and T takes a bad-edge (v, u) on the i-th step, then Lu ⊆ Ls

and s is not reachable from u.

For i ∈ {0, . . . ,m}, let L′i be the set of vertices v in layer-i of B′ such that Pr(v) > 0 and it

is possible to reach s from v (in particular, the set of high-probability equations stored in v is

also stored in s). For i ∈ {1, . . . ,m}, let Γi be the set of edges e from L′i−1 to L′i of B′, such that

Pr(e) > 0.

Recall that by the construction of the branching-program B, part-j runs a copy of all previous

parts of the computation. Thus, a vertex v in B′ or equivalently a vertex v in part-j of B has corre-

sponding vertices v′1, . . . , v′j−1 in layer-i of parts 1, . . . , j− 1, respectively, such that, if the path T

reached v it previously reached v′1, . . . , v′j−1. We denote by v′j = v. We denote by

Sig(v) ≜j⋃

j′=1

Sig(v′j′).

Recall that by stopping rules 2 and 6, the path T stops if x ∈ Sig(v).

The next claim bounds the probability of stopping on a vertex v in part-2 due to stopping

rule 2 of part-1 on the vertex v′ that v remembers.

Claim 3.5.1. If v is a non-significant vertex in layer-i of part-2 that remembers v′, and v′ is a

non-significant vertex in layer-i of part-1, then

Prx[x ∈ Sig(v′) | v1 → v] ≤ 2−2ℓ.

125


Ex′∼Px|v1→v

[Px|v0→v′(x′)

]=∑x′∈X

[Px|v0→v′(x′) · Px|v1→v(x′)

](using Cauchy-Schwarz)

≤√∑

x′∈X

Px|v0→v′(x′)2 ·∑x′∈X

Px|v1→v(x′)2

= 2n ·√

Ex′∈RX

[Px|v0→v′(x′)2

]E

x′∈RX

[Px|v1→v(x′)2

](since both v′ and v are non-significant)

≤ 2ℓ1+ℓ2 · 2−n.


Prx′∼Px|v1→v

[Px|v0→v′(x′) > 24ℓ · 2−n

]≤ 2ℓ1+ℓ2−4ℓ ≤ 2−2ℓ.

Since conditioned on the event v1 → v, the distribution of x is Px|v1→v, we obtain

Prx

[x ∈ Sig(v′)

∣∣ v1 → v]= Pr

x

[(Px|v0→v(x) > 24ℓ · 2−n

) ∣∣ v1 → v]≤ 2−2ℓ.

Claim 3.5.2. Let i ∈ {1, . . . ,m}. For any edge e = (v, u) ∈ Γi, labeled by (a, b), such that

Pr(e) > 0, for any x′ ∈ X,

Px|e(x′) =



where ce is a normalization factor that satisfies

126

• ce ≥ 12 − 2 · 2−2r, if i /∈ Ls.

• ce ≥ 2−4ℓ − 2 · 2−2ℓ ≥ 2−5ℓ, if i ∈ Ls.

Proof. Let v′1, . . . , v′j be the vertices in the branching program B that v remembers. Let e = (v, u)

be an edge of B′, labeled by (a, b), and such that Pr(e) > 0. Since Pr(e) > 0, the vertices

v′1, . . . , v′j are not significant (as otherwise T always stops on v and hence Pr(e) = 0). Also,

since Pr(e) > 0, we know that (a, b) is not very-bad (as otherwise T never traverses e and hence

Pr(e) = 0).

If T reaches v, it traverses the edge e if and only if: x ∈ Sig(v) (as otherwise T stops on v) and

M(a, x) = b and ai+1 = a. Therefore, for any x′ ∈ X,

Px|e(x′) =



where ce is a normalization factor, given by

ce =∑

{x′ : x′ ∈Sig(v) ∧M(a,x′)=b}Px|v(x′) = Pr

x[(x ∈ Sig(v)) ∧ (M(a, x) = b) | Ev].

Since v′1, . . . , v′j are not significant, by Claim 3.4.1 and Claim 3.5.1:

Prx[x ∈ Sig(v) | Ev] ≤

j∑j′=1

2−2ℓ ≤ 2 · 2−2ℓ ≤ 2−2r.

If i /∈ Ls, then a /∈ Bad(v), as otherwise Lu ⊆ Ls and s is not reachable from u. Thus

∣∣∣Prx[M(a, x) = 1 | Ev]− Pr

x[M(a, x) = −1 | Ev]

∣∣∣ = ∣∣(M · Px|v)(a)∣∣ ≤ 2−2r,

and hence

Prx[M(a, x) = b | Ev] ≤ 1

2 + 2−2r.

127

Hence, by the union bound,

ce = Prx[(x ∈ Sig(v)) ∧ (M(a, x) = b) | Ev] ≥ 1

2 − 2 · 2−2r.

If i ∈ Ls, then (a, b) /∈ VeryBad(v), and we havePrx[M(a, x) = b | Ev] ≥ 2−4ℓ. Thus,

ce = Prx[(x ∈ Sig(v)) ∧ (M(a, x) = b) | Ev] ≥ 2−4ℓ − 2 · 2−2ℓ .



e of B′ that is traversed by T starting from sj−1 with probability larger than zero, ∥Px|e∥2 cannot

be too large.

Claim 3.5.3. For any edge e of B′, such thatPr(e) > 0,

∥Px|e∥2 ≤ 25ℓ · 2ℓj · 2−n.

Proof. Let e = (v, u) be an edge of B′, labeled by (a, b), and such thatPr(e) > 0. SincePr(e) >

0, the vertex v is not significant (as otherwise T always stops on v and hence Pr(e) = 0). Thus,

∥Px|v∥2 ≤ 2ℓj · 2−n.


Px|e(x′) =



where ce satisfies ce ≥ 2−5ℓ. Thus,

∥Px|e∥2 ≤ c−1e · ∥Px|v∥2 ≤ 25ℓ · 2ℓj · 2−n

128

Claim 3.5.4.

∥Px|s∥2 ≤ 25ℓ · 2ℓj · 2−n.

Proof. Let Γin(s) be the set of all edges e of B′, that are going into s, such that Pr(e) > 0. Note

that ∑e∈Γin(s)

Pr(e) = Pr(s).


Px|s(x′) =∑

e∈Γin(s)



Px|s(x′)2 ≤∑

e∈Γin(s)



∥Px|s∥22 ≤∑

e∈Γin(s)



∥Px|e∥22 ≤(25ℓ · 2ℓj · 2−n

)2.

Hence,

∥Px|s∥22 ≤(25ℓ · 2ℓj · 2−n

)2.

129




[f(z) · g(z)].



Claim 3.5.5.

⟨Px|s,Px|s⟩ > 22ℓj · 2−2n.


⟨Px|s,Px|s⟩ = ∥Px|s∥22 > 22ℓj · 2−2n.

Claim 3.5.6.

⟨UX,Px|s⟩ = 2−2n,



⟨UX,Px|s⟩ = 2−2n ·∑z∈X

Px|s(z) = 2−2n.


For i ∈ {0, . . . ,m}, let

Zi =∑v∈L′i


130

For i ∈ {1, . . . ,m}, let

Z ′i =∑e∈Γi




For a vertex v ∈ L′i ofB′, let Γout(v) be the set of all edges e ofB′, that are going out of v toL′i+1,

such that Pr(e) > 0. Note that

∑e∈Γout(v)

Pr(e) ≤ Pr(v).

(We don’t always have an equality here, since sometimes T stops on v, or goes to a vertex from

which s is not reachable).

Recall that Ls stores a (not too long) list of indices to layers on which the pathmight choose to

go over bad edges. The next four claims show that the progress made by the branching program

is slow on every layer i /∈ Ls. On layers i ∈ Ls the progress might be significant but we will still

have meaningful bounds on it.

Claim 3.5.7. For every vertex v ∈ L′i−1, such that Pr(v) > 0,

∑e∈Γout(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩k ≤ ⟨Px|v,Px|s⟩k · cki + 2−2nk+k · cki ,

where ci is defined as

• ci = 1+ 2−r, if i /∈ Ls.

• ci = 25ℓ, if i ∈ Ls.



Thus, we can assume that v is not significant and is not a leaf.

131


P(x′) =



Note that by the definition of Sig(v) and since Sig(v) ⊆ Sig(v), for any x′ ∈ X,

P(x′) ≤ 24ℓ · 2−n. (3.19)


f(x′) = P(x′) · Px|s(x′).


∥f∥2 ≤ 24ℓ · 2−n · ∥Px|s∥2 ≤ 24ℓ · 2−n · 25ℓ · 2ℓj · 2−n ≤ 210ℓ · 2−2n. (3.20)

By Claim 3.5.2, for any edge e ∈ Γout(v), labeled by (a, b), for any x′ ∈ X,

Px|e(x′) =

0 if M(a, x′) = b

P(x′) · c−1e if M(a, x′) = b

where ce satisfies ce ≥ 12 − 2 · 2−2r if i /∈ Ls and ce ≥ 2−5ℓ if i ∈ Ls. Denote by cv the minimal

value that ce can get for e ∈ Γout(v). By the above, cv ≥ 2−5ℓ and cv ≥ 12 − 2 · 2−2r if i /∈ Ls. Note

that c−1v ≤ 2ci in both cases (recall that ci = 25ℓ for i ∈ Ls and ci = 1+ 2−r if i /∈ Ls). Therefore,

for any edge e ∈ Γout(v), labeled by (a, b), for any x′ ∈ X,

Px|e(x′) · Px|s(x′) =

0 if M(a, x′) = b

f(x′) · c−1e if M(a, x′) = b

132

and hence, we have



[f(x′) · c−1e · 1{x′∈X : M(a,x′)=b}]

= Ex′∈RX

[f(x′) · c−1e · (1+b·M(a,x′))

2

]≤ (∥f∥1 + b · ⟨Ma, f⟩) · (2cv)−1. (3.21)


Case I: ∥f∥1 < 2−2n

In this case, we bound |⟨Ma, f⟩| ≤ ∥f∥1 (since f is non-negative and the entries of M are

in {−1, 1}) and obtain for any edge e ∈ Γout(v),

⟨Px|e,Px|s⟩ < c−1v · 2−2n ≤ 2ci · 2−2n.

Since∑





t(a) =|⟨Ma, f⟩|∥f∥1

.

By Equation (3.21),

⟨Px|e,Px|s⟩k < ∥f∥k1 · (1+ t(a))k · (2cv)−k (3.22)


∥f∥1 = Ex′∈RX

[f(x′)] = ⟨P,Px|s⟩ ≤ ⟨Px|v,Px|s⟩.

Note also that for every a ∈ A, there is at most one edge e(a,1) ∈ Γout(v), labeled by (a, 1), and at

133

most one edge e(a,−1) ∈ Γout(v), labeled by (a,−1), and we have

Pr(e(a,1))Pr(v) +

Pr(e(a,−1))

Pr(v) ≤ 1|A| ,

since 1|A| is the probability that the next sample read by the program is a. Thus, summing over all

e ∈ Γout(v), by Equation (3.22),

∑e∈Γout(v)


a∈RA

[(1+ t(a))k

]· (2cv)−k. (3.23)

It remains to bound

Ea∈RA

[(1+ t(a))k

], (3.24)

using the properties of the matrixM and the bounds on the L2 versus L1 norms of f.

By Equation (3.20) and the assumption that ∥f∥1 ≥ 2−2n we get

∥f∥2∥f∥1

≤ 210ℓ .

SinceM is a (10k, 10ℓ)-L2-extractor with error 2−10r, there are at most 2−10k · |A| rows a ∈ A

with t(a) =|⟨Ma,f⟩|∥f∥1 ≥ 2−10r. We bound the expectation in Equation (3.24), by splitting the

expectation into two sums

Ea∈RA

[(1+ t(a))k

]= 1|A| ·

∑a : t(a)≤2−10r

(1+ t(a))k + 1|A| ·

∑a : t(a)>2−10r

(1+ t(a))k . (3.25)

We bound the first sum in Equation (3.25) by (1 + 2−10r)k. As for the second sum in Equa-

tion (3.25), we know that it is a sum of at most 2−10k · |A| elements, and since for every a ∈ A,

we have t(a) ≤ 1, we have

1|A| ·

∑a : t(a)>2−10r

(1+ t(a))k ≤ 2−10k · 2k ≤ 2−2r

134

(where in the last inequality we used the fact that r ≤ k). Overall, we get

Ea∈RA

[(1+ t(a))k

]≤ (1+ 2−10r)k + 2−2r ≤ (1+ 2−2r)k+1. (3.26)


∑e∈Γout(v)


(1+ 2−2r

)k+1 · (2cv)−k.

If i /∈ Ls, then (2cv)−1 ≤ (1+ 2−2r+3) and thus (1+ 2−2r)k+1 · (2cv)−k ≤ (1+ 2−r)k (where the

inequality uses the assumption that r is sufficiently large).

If i ∈ Ls, then (2cv)−1 ≤ 12 · 2

5ℓ and thus (1+ 2−2r)k+1 · (2cv)−k ≤ 25ℓk. This completes the


Claim 3.5.8. Recall the definition of ci from Claim 3.5.7. For every i ∈ {1, . . . ,m},

Z ′i ≤ (Zi−1 + 2−2nk+k) · cki


Z ′i =∑e∈Γi

Pr(e) · ⟨Px|e,Px|s⟩k =∑v∈L′i−1

Pr(v) ·∑

e∈Γout(v)


≤∑v∈L′i−1

Pr(v) ·(⟨Px|v,Px|s⟩k + 2−2nk+k) · cki

= cki ·(Zi−1 +

∑v∈L′i−1

Pr(v) · 2−2nk+k)

≤ cki ·(Zi−1 + 2−2nk+k)


Zi ≤ Z ′i .

135

Proof. For any v ∈ L′i, let Γin(v) be the set of all edges e ∈ Γi, that are going into v. Note that

∑e∈Γin(v)

Pr(e) = Pr(v).

By the law of total probability, for every v ∈ L′i and every x′ ∈ X,

Px|v(x′) =∑

e∈Γin(v)

Pr(e)Pr(v) · Px|e(x′),

and hence

⟨Px|v,Px|s⟩ =∑

e∈Γin(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩.

Thus, by Jensen’s inequality,

⟨Px|v,Px|s⟩k ≤∑

e∈Γin(v)

Pr(e)Pr(v) · ⟨Px|e,Px|s⟩k.

Summing over all v ∈ L′i, we get

Zi =∑v∈L′i

Pr(v) · ⟨Px|v,Px|s⟩k ≤∑v∈L′i

Pr(v) ·∑

e∈Γin(v)


=∑e∈Γi

Pr(e) · ⟨Px|e,Px|s⟩k = Z ′i .


Zi ≤ 2r+3k+5ℓk·|Ls| · 2−2k·n.

Proof. By Claim 3.5.8 and Claim 3.5.9, for every i ∈ {1, . . . ,m},

Zi ≤ (Zi−1 + 2−2nk+k) · cki

136

where ci = (1 + 2−r) if i /∈ Ls and ci = 25ℓ if i ∈ Ls. Thus, we can show by induction on

i ∈ {1, . . . ,m} that

Zi ≤ 2−2nk+k · (i+ 1) ·i∏

i′=1

cki′

Hence, for any i ∈ {1, . . . ,m} it holds that

Zi ≤ 2−2nk+k · (m+ 1) · (1+ 2−r)mk · 25ℓk|Ls|.

Sincem ≤ 2εr ≤ 2r − 1,

Zi ≤ 2−2k·n+k · 2r · ek · 25ℓk|Ls|.


We can now complete the proof of Lemma 3.4.3. Assume that s is in layer-i ofB′. ByClaim 3.5.5,

Zi ≥ Pr(s) · ⟨Px|s,Px|s⟩k > Pr(s) ·(22ℓj · 2−2n

)k= Pr(s) · 22ℓj·k · 2−2k·n.


Zi ≤ 2r+3k+5ℓk·|Ls| · 2−2k·n.

Thus, we get

Pr(s) ≤ 2r+3k+5ℓk·|Ls| · 2−2ℓj·k

We treat differently the case j = 1 and j = 2 as follows. For j = 1, the set Ls is empty, and we have

Pr(s) ≤ 2r+3k · 2−2ℓ1·k ≤ 2−ℓ1·k.

For j = 2 we have

Pr(s) ≤ 2r+3k+5ℓk·|Ls| · 2−2ℓ·k

137

≤ 2r+3k+1000εℓ2k · 2−2ℓ·k (|Ls| ≤ 200εℓ)

≤ 24k+1000εℓk · 2−2ℓ·k (r ≤ k, ℓ ≤√ℓ)

≤ 2−ℓk ≤ 2−ℓ1·k . (ε < 1/1010)

Thus, in both cases we showedPr(s) ≤ 2−ℓ1k.

Recall that we showed that the width of B is at most 2εkℓ/2, and note that the length of B is at

most 2·2εr. Taking a union bound over atmost 2εkℓ/2 ·2·2εr ≤ 2kℓ1/2 significant vertices of B, we

conclude that the probability that T reaches any significant vertex is at most 2−kℓ1/2 = o(1).

138

If you have an idea, you have to believe

in yourself or no one else will.

SarahMichelle Gellar

4Security of Goldreich’s PRG against

Space-Bounded Adversaries

The results in this chapter are based on joint work with Pravesh K. Kothari and Ran

Raz [GKR20].

This chapter is motivated by the following basic question: suppose an algorithm is provided

with a stream ofm i.i.d. samples from a random source. What’s the minimummemory required

to decide whether the source is “truly random” or “pseudorandom”?

Algorithmically distinguishing perfect randomness from pseudorandomness naturally arises

in the context of learning theory (and can even be equivalent to learning in certain mod-

139

els [Dan16, DS16, Vad17, KL18]), pseudorandomness and cryptography. As discussed in previ-

ous chapters, there has been a surge of progress in proving lower bounds for memory-bounded

streaming algorithms beginning with Shamir [Sha14] and Steinhardt-Valiant-Wager [SVW16]

who conjectured a Ω(n2) memory lower bound for learning parity functions with 2o(n) sam-

ples. This conjecture was proven in [Raz16]. In a follow up work, this was generalized to learn-

ing sparse parities in [KRT17] and more general learning problems in [Raz17, GRT18, MM17,

BGY18, DS18, MT17, MM18, SSV19, GRT19].

All of these lower bounds hold for learning (more generally, search) problems that ask to

identify an unknown member of a target function class from samples. In this work [GKR20],

we build on the progress above and develop techniques to show lower bounds for appar-

ently easier task of simply distinguishing uniformly distributed samples from pseudorandom

ones. [DGKR19] studies the related problem of distribution testing under communication and

memory constraints. [DGKR19] gave a one-pass streaming algorithm (and a matching lower

bound for a broad range of parameters) for uniformity testing on [N] that usesmmemory and

O(N log(N)/(mε4)) samples for distinguishing between uniform distribution on [N] and any

distribution that is ε-far from uniform.

As we next discuss, the results in this chapter have consequences of interest in cryptography

(ruling out memory-bounded attacks on Goldreich’s pseudorandom generator [Gol00] in the

streaming model) and average-case complexity (unconditional lower bounds on the number of

samples needed, formemory-bounded algorithms, to refute randomconstraint satisfactionprob-

lems, in the streaming model).


Wenowdescribe the results inmoredetail. We showmemory-sample trade-offs for distinguishing

between truly random and pseudorandom sources for the following two settings:

1. Uniform vs k-Subspace Source: The pseudorandom subspace source of dimension k

chooses some arbitrary k-dimensional linear subspace S ⊆ {0, 1}n and draws points uni-

140

formly from S. The truly random source draws points uniformly from {0, 1}n.

2. Uniform vs Local Pseudorandom Source: The pseudorandom source fixes a k-ary

Boolean predicate P : {0, 1}k → {0, 1}. It chooses a uniformly random x ∈ {0, 1}n

and generates samples (α, b) ∈ [n](k)×{0, 1}where [n](k) represents the set of all ordered

k-tuples with exactly k elements from [n] and α is chosen uniformly at random from [n](k)

and b is the evaluation of P at xα - the k-bit string obtained by projecting x onto the coor-

dinates indicated by α. The truly random source generates samples (α, b)where α ∈ [n](k)

and b ∈ {0, 1} are chosen uniformly and independently.

Wemodel a distinguishing algorithm by a branching program (BP) of width 2b (or memory b)

and lengthm. Such amodel captures any algorithm that takes as input a stream ofm samples and

has a memory of at most b bits. Observe that there’s no restriction on the computation done at

any node of an BP. Roughly speaking, this model gives the algorithm unbounded computational

power and bounds only its memory-size and the number of samples used.

The first main result shows a lower bound on memory-bounded BPs for distinguishing be-

tween uniform and k-subspace sources.

Theorem 4.0.1 (Uniform vs Subspace Sources). Any algorithm that distinguishes between uni-

form and subspace source of dimension k (assuming k > c log n for some large enough constant c)

with probability at least 1/2+ 2−o(k) requires either a memory ofΩ(k2) or at least 2Ω(k) samples.

In particular, distinguishing between the uniform distribution on {0, 1}n and the uniform distri-

bution on an unkown linear subspace of dimension n/2 in {0, 1}n requiresΩ(n2)memory or 2Ω(n)

samples.

Crouch et. al. [CMVW16] recently proved that any algorithm that uses at most n/16 bits of

space requires Ω(2n/16) samples to distinguish between uniform source and a subspace source of

dimension k = n/2. They suggest the question of improving the space bound to Ω(n2) while

noting that their techniques do not suffice. For k = Θ(n), the lower bound shows that any

141

algorithm with memory at most cn2 for some absolute constant c requires 2Ω(n) samples. This

resolves their question.

Upper bound: In Section 4.3, we exhibit a simple explicit branching program that uses 2O(k)

samples andO(1)memory to succeed in solving the distinguishing problemwith probability 3/4.

We also show a simple algorithm that uses O(k2) memory and O(k) samples, and succeeds in

solving thedistinguishingproblemwithprobability 3/4. Thus, in thebranchingprogrammodel,

the lower bound is tight up to constants in the exponent.

Second main result gives a memory-sample trade-off for the uniform vs local pseudorandom

source problem for all predicates that have a certainwell-studied pseudorandomproperty studied

in cryptography under the name of resilience. A k-ary Boolean functionP is said to be t-resilient if

t is the maximum integer such that (−1)P (taking the range of the boolean function to be {-1,1})

has zero correlation with every parity function of at most t − 1 out of k bits. In particular, the

parity function on k bits is k-resilient.

Theorem 4.0.2 (Uniform vs Local Pseudorandom Sources). Let 0 < ε < 1 − 3 log 24log n and P be

a t-resilient k-ary predicate for k < n(1−ε)/6/3, n/c1. Then, any BP that succeeds with probability

at least 1/2+Ω(( tn

)Ω(t·(1−ε))) at distinguishing between uniform and local pseudorandom source

for predicate P, requires(nt

)Ω(t·(1−ε)) samples or nε memory.

Upper bound: In Section 4.3, we give an algorithm that takes (nε + k)k log n memory and

(n(1−ε)k)(nε+k) samples, anddistinguishes betweenuniformand local pseudorandomsource for

any predicate P, with probability 99/100. Thus, the lower bounds are almost tight up to log n

factors and constant factors in the exponent for certain predicates (t = Ω(k)). The question

of whether there exists a better algorithm that runs in O(n(1−ε)t) samples and O(nε) memory,

and distinguishes between uniform and local pseudorandom source with high probability, for

t-resilient predicates P, remains open.

This result has interesting implications for well-studied algorithmic questions in average-

case complexity and cryptography such as refuting random constraint satisfaction [Fei02,1c is a large enough constant

142

AOW15, RRS17, KMOW17]) and existence of local pseudorandom generators [CM01,

MST06, BBKK18, LV17b, App13, App16] with super linear stretch where a significant effort

has focused on proving lower bounds on various restricted models such as propositional and

algebraic proof systems, spectral methods, algebraic methods and semidefinite programming hi-

erarchies. While bounded memory attacks are well-explored in cryptography [Mau92, CM97,

AR99, ADR02, Vad03, DM04, Raz16, VV16, GZ19], to the best of our knowledge, memory

has not been studied as explicit resource in this context. We discuss these applications further in

the chapter.

For the special case when P(x) =∑k

i xi mod 2, the parity function on k bits, we can prove

stronger results for a wider range of parameters.

Theorem 4.0.3 (Uniform vs Local Pseudorandom Sources with Parity Predicate). Let 0 < ε <

1 − 3 log 24log n and P be the parity predicate on k bits for 0 < k < n/c (c is a large enough constant).

Suppose there’s a BP that distinguishes between uniform and local pseudorandom source for the par-

ity predicate, with probability at least 1/2 + s and uses< nε memory. If s > Ω(( k

n

)Ω((1−ε)·k)),

then, the BP requires(nk

)(Ω((1−ε)·k) samples.

4.0.2 Applications to Security of Goldreich’s PseudorandomGenerator

A fundamental goal in cryptography is to produce secure constructions of cryptographic prim-

itives that are highly efficient. In line with this goal, Goldreich [Gol00] proposed a candidate

one-way function given by the following pseudorandom mapping that takes n-bit input x and

outputsm bits: fix a predicate P : {0, 1}k → {0, 1}, pick a1, a2, . . . , am uniformly at random2

from [n](k) and output P(xa1),P(xa2), . . . ,P(xam). Here, a1, . . . , am and P are public and the

seed x is secret. Later works (starting with [MST03]) suggested using this candidate as pseudo-

random generator.

The main question of interest is the precise trade-off between the locality k and the stretch m

for a suitable choice of the predicate P. In several applications, we need that the generator has a2More generally, Goldreich proposed that a1, a2, . . . , am could be chosen in a pseudorandomway so as to ensure

a certain “expansion” property. We omit a detailed discussion here.

143

super-linear stretch (i.e. m = n1+δ for some δ > 0) with constant locality (i.e. k = O(1)).

The simplicity and efficiency of such a candidate is of obvious appeal. This simplicity has

been exploited to yield a host of applications including public-key cryptography from combi-

natorial assumptions [ABW10], highly efficient secure multiparty computation [IKO+11] and

most recently, basing indistinguishability obfuscation on milder assumptions [Lin16a, AJS15,

LV16, Lin16b, LT17].

Evidence for the security of Goldreich’s candidate has been based on analyzing natural classes

of attacks based on propositional proof systems [ABSRW04], spectral methods and semidef-

inite programming hierarchies [OW14, AOW15, BCK15, KMOW17, LV17a, BBKK18] and

algebraic methods [ABR16, AL16]. In particular, previous works [KMOW17, AL16] identi-

fied t-resiliency of the predicate P as a necessary condition for the security of the candidate for

m = nΩ(t) stretch.

The uniform vs local pseudorandom source problem considered in this work is easily seen as

the algorithmic question of distinguishing the output stream generated byGoldreich’s candidate

generator from a uniformly random sequence of bits. In particular, the results imply security of

Goldreich’s candidate against boundedmemory algorithms for super-linear stretch when instan-

tiated with any t-resilient predicate for large enough constant t (but in the streaming model).

Goldreich’s candidate generator would fix the sets a1, a2, . . . , am (which are public) and output

P(xa1),P(xa2), . . . ,P(xam) for n sized input x (m > n). We prove the security of Goldreich’s

generator in the model where a1, a2, . . . , am, still public, are chosen uniformly at random from

[n](k) and streamedwith the generated bits. We note that the lower bounds continue to hold even

when the locality k grows polynomially with the seed length n.

Corollary 4.0.1 (Corollary ofTheorem4.0.2). Let0 < ε < 1−3 log 24log n andPbe a t-resilient k-ary

predicate for k = O(n(1−ε)/6). Then, Goldreich’s PRG, when instantiated with any t-resilient k-

ary predicate P such that k ≥ t > 36 and stretchm = (n/t)O(t)(1−ε), is secure against all branching

programs with memory-size bounded from above by nε, in the streaming model.

144

4.0.3 Applications to Refuting RandomCSPs

Theorems 4.0.2 and 4.0.3 can also be intepreted as lower bounds for the problem of refuting

random constraint satisfaction problems.

A random CSP with predicate P : {0, 1}k → {0, 1} is a constraint satisfaction problem on

n variables x ∈ {0, 1}n. More relevant to us is the variant where the constraints are randomly

generated as follows: choose an ordered k-tuple of variables a from [n] at random, a bit b ∈

{0, 1} at random and impose a constraint P(xa) = b. When the number of constraints m ≫

n, the resulting instance is unsatisfiable with high probability for any non-constant predicate P.

The natural algorithmic problem in this regime is that of refutation - efficiently finding a short

witness that certifies that the given instance is far from satisfiable. It is then easy to note that the

uniform vs local pseudorandom source problem is the task of distinguishing between constraints

in a random CSP (with predicate P) and one with a satisfying assignment. Note that refutation

is formally harder than the task of distinguishing between a random CSP and one that has a

satisfying assignment.

Starting with investigating in proof complexity, randomCSPs have been intensively studied in

the past three decades. When P is t-resilient for t ≥ 3, all known efficient algorithms [AOW15]

require m ≫ n1.5 samples for the refutation problem. This issue was brought to the forefront

in [Fei02] where Feige made the famous “Feige’s Hypothesis” conjecturing the impossibility of

refuting random 3SAT in polynomial timewith Θ(n) samples. Variants of Feige’s hypothesis for

other predicates have been used to derive hardness results in both supervised [DLS13, DLS14a,

DLS14b] and unsupervised machine learning [BM16].

In [OW14], t-resilience was noted as a necessary condition for the refutation problem to be

hard. Theorems 4.0.2 and 4.0.3 confirm this as a sufficient condition for showing lower-bounds

for the refutation (in fact, even for the easier “distinguishing” variant) of random CSPs, with

t-resilient predicates, in the streaming model with bounded memory.

145

4.1 Preliminaries

We use Ber(p) to denote the Bernoulli distribution with parameter p (probability of being 1). We

use [n] to denote the set {1, 2, ..., n}. For a random variable Z and an event E, we denote by PZ

the distribution of the randomvariablesZ, andwe denote byPZ|E the distribution of the random

variable Z conditioned on the event E.

Given an n−bit vector y ∈ {0, 1}n, we use yi to denote the ith coordinate of y, that is, y =

(y1, y2, ..., yn). We use y−i ∈ {0, 1}n−1 to denote y but with the ith coordinate deleted. Given

two n−bit vectors y, y′, we use ⟨y, y′⟩ to denote the inner product of y and y′ modulo 2, that is,

⟨y, y′⟩ =∑n

i=1 yiy′i mod 2. We use |y| to denote the number of ones in the vector y.

Given a set S, we use y ∈R S to denote the random process of picking y uniformly at random

from the set S. Given a probability distributionD, we use y ∼ D to denote the random process

of sampling y according to the distributionD.

Next, we restate the results from previous papers [Raz16, Raz17, KRT17, GRT18] used in

this chapter. To do so, we refer to the preliminaries and notations from Chapter 1 (Section 1.1).

Let X, A be two finite sets of size larger than 1. LetM : A × X → {−1, 1} be a matrix. The

matrixM corresponds to a learning problem as defined in Section 1.1.1. These papers model the

learner for the learning problem corresponding to the matrixM using a branching program as

shown in Section 1.1.4.

Theorem 4.1.1. [Raz16, Raz17, GRT18] Any branching program that learns x ∈ {0, 1}n, from

random linear equations over F2 with success probability 2−cn requires either a width of 2Ω(n2) or a

length of 2Ω(n) (where c is a small enough constant).

Wewould use the following lemma from [KRT17].

Lemma 4.1.1. [KRT17] Let Tl be the set of n-bit vectors with sparsity exactly-l for l ∈ N, that

is, Tl = {x ∈ {0, 1}n |∑n

i=1 xi = l}. Let δ ∈ (0, 1]. Let BTl(δ) = {α ∈ {0, 1}n |

146

∣∣Ex∈Tl(−1)⟨α,x⟩∣∣ > δ}. Then, for δ ≥ ( 8ln )

l2 ,

|BTl(δ)| ≤ 2e−δ2/l·n/8 · 2n

4.1.1 Branching Program for a Distinguishing Problem

Let X, A be two finite sets of size larger than 1. Let D0 be a distribution over the sample space

|A|. Let {D1(x)}x∈X be a set of distributions over the sample space |A|. Consider the following

distinguishing problem: An unknown b ∈ {0, 1} is chosen uniformly at random. If b = 0,

the distinguisher is given independent samples from D0. If b = 1, an unknown x ∈ X is cho-

sen uniformly at random, and the distinguisher is given independent samples from D1(x). The

distinguisher tries to learn b from the samples drawn according to the respective distributions.

Formally, we model the distinguisher by a branching program as follows.

Definition 4.1.1. Branching Program for a Distinguishing Problem: A branching program

of length m and width d, for distinguishing, is a directed (multi) graph with vertices arranged in

m+ 1 layers containing at most d vertices each. In the first layer, that we think of as layer 0, there is

only one vertex, called the start vertex. A vertex of outdegree 0 is called a leaf. All vertices in the last

layer are leaves (but there may be additional leaves). Every non-leaf vertex in the program has |A|

outgoing edges, labeled by elements a ∈ A, with exactly one edge labeled by each such a, and all these

edges going into vertices in the next layer. Each leaf v in the program is labeled by a b(v) ∈ {0, 1},

that we think of as the output of the program on that leaf.

Computation-Path: The samples a1, . . . , am ∈ A that are given as input, define a

computation-path in the branching program, by starting from the start vertex and following at step t

the edge labeled by at, until reaching a leaf. The program outputs the label b(v) of the leaf v reached

by the computation-path.

Success Probability: The success probability of the program is the probability that b = b, where

b is the element that the programoutputs, and the probability is overb, x, a1, . . . , am (whereb is uni-

formly distributed over {0, 1}, x is uniformly distributed over X and a1, . . . , am are independently

147

drawn fromD0 if b = 0 and D1(x) if b = 1).

4.2 Overview of the Proofs

We prove the theorems using two different techniques. We prove Theorem 4.0.1 through re-

duction to thememory-sample lower bounds for the corresponding learning problem in Section

4.3. Informally, for Theorem 4.0.1, we construct a branching program that learns the unknown

vector x from random linear equations inF2 by guessing each bit one by one sequentially and us-

ing the distinguisher, for distinguishing subspaces from uniform, to check if it guessed correctly.

Then, we are able to lift the previously-known memory-sample lower bounds for the learning

problem (Theorem 4.1.1) to the distinguishing problem.

We prove Theorems 4.0.2 and 4.0.3 in Section 4.4. Recall, a pseudorandom source fixes a k-

ary Boolean predicate P : {0, 1}k → {0, 1}. It chooses a uniformly random x ∈ {0, 1}n and

generates samples (α, b) ∈ [n](k) × {0, 1} where α is a uniformly random (ordered) k-tuple of

indices in [n] and b is the evaluation of P at xα - the k-bit string obtained by projecting x onto the

coordinates indicated by α. The truly random source samples (α, b) where α ∈ [n](k) and b ∈

{0, 1} are chosen uniformly and independently. The problem for a distinguisher is to correctly

guess whether the m samples are generated by a pseudorandom or a uniform source, when the

samples arrive in a stream. We first show through a hybrid argument that a distinguisher A that

distinguishes between the uniform and pseudorandom source, with an advantage of s over 1/2,

can also distinguish (with advantage of at least s/m) when only the jth (for some j) sample is

drawn from the “unknown source”, the first j − 1 samples are drawn from the pseudorandom

source and the lastm− j samples are drawn from the uniform source.

Let v be the memory state of A after seeing the first j− 1 samples, which were generated from

a pseudorandom source with a seed x picked uniformly at random from {0, 1}n. Let Px|v be the

probability distribution of the random variable x conditioned on reaching v. If the jth sample

is generated using the same pseudorandom source, then ∀α ∈ [n](k), the bit b is 0 with prob-

ability∑

x′:P(x′α)=0 Px|v(x′) and 1 with probability 1 −∑

x′:P(x′α)=0 Px|v(x′). If the jth sample is

148

generated using the uniform source, then ∀α ∈ [n](k), the bit b is 0 with probability 1/2 and

1 with probability 1/2. Thus, for any α, A can identify the “unknown source” with an at most∣∣∣∑x′:P(x′α)=0 Px|v(x′)− 1/2∣∣∣ advantage.

We show thatwhenAhas lowmemory (< nε for some 0 < ε < 1), thenwith high probability,

it reaches a state v such thatPx|v has highmin-entropy (informally, it’s hard to determine the seed

for the pseudorandom source). We then use t-resiliency of P to show that when Px|v has high

min-entropy, thenwith high probability over α ∈ [n](k), bbehaves almost like in a uniform source

(Lemma 4.4.1), that is, |∑

x′:P(x′α)=0 Px|v(x′) − 1/2| is small. Hence, with high probability, it’s

hard for A to judge with ’good’ advantage whether b was generated from a pseudorandom or a

uniform source3. Note that the lastm− j samples generated by a uniform source can’t better this

advantage.

4.3 Time-Space Tradeoff through Reduction to Learning

In this section, we will prove time-space tradeoffs for the following distinguishing problem using

black-box reduction from the corresponding learning problem.

Distinguishing Subspaces from Uniform Informally, we study the problem of distin-

guishing between the cases when the samples are drawn from a uniform distribution over {0, 1}n

and when the samples are drawn randomly from a subspace of rank k over F2.

Let L(k, n) be the set of all linear subspaces of dimension k (⊆ {0, 1}n), that is, L(k, n) con-

tains all subspacesV such thatV = {v ∈ {0, 1}n | ⟨wi, v⟩ = 0 ∀i ∈ [n− k]} for some linearly

independent vectorsw1,w2, ...,wn−k. Formally, we consider distinguishers for distinguishing be-

tween the following distributions:

1. D0: Uniform distribution over {0, 1}n.

2. D1(S), S ∈ L(k, n): Uniform distribution over S.3Such arguments have been previously used to prove that a source fools bounded space, such as [NZ96]

149

Note: If the subspace S is revealed, it’s easy for a branching program of constant width to

distinguish w.h.p. by checking the inner product of the samples with a vector in the orthogonal

complement of S.

A distinguisher can distinguish subspaces if for an unknown random linear subspace S ∈

L(k, n), it can distinguish between D0 and D1(S). Formally, a distinguisher L, after seeing m

samples, has a success probability of p if

Pru1,...,um∼D0 [L(u1, ..., um) = 0] + PrS∈RL(k,n);u1,...,um∼D1(S)[L(u1, ..., um) = 1]2

= p (4.1)

Theorem 4.3.1. For k > c2 log n (where c2 is a large enough constant), any algorithm that can

distinguish k-dimensional subspaces overFn2 fromFn

2 ({0, 1}n), when samples are drawn uniformly

at random from the subspace or Fn2 respectively, with success probability at least 1

2 + 2−o(k) requires

either a memory of sizeΩ(k2) or 2Ω(k) samples.

We prove the theorem in Subsection 4.3.1. Briefly, we prove that using a distinguisher for dis-

tinguishing subspaces, we can construct a branching program that learns an unknown bit vector

x from random linear equations over F2. Then, we are able to lift the time-space tradeoffs of

Theorem 4.1.1.

Tightness of the Lower Bound We note two easy upper bounds that show that the re-

sults in Theorem 4.3.1 are tight (up to constants in the exponent). Firstly, we observe an algo-

rithm B1 that distinguishes subspaces of dimension k from uniform, using O(k2) memory and

O(k) samples, with probability at least 3/4 (0 < k ≤ n − 1). B1 stores the first min(8k, n)

bits of the first 8k samples (in O(k2)memory); outputs 1 if the samples (projected onto the first

min(8k, n) coordinates) belong to a ≤ k-dimensional subspace (of {0, 1}min(8k,n)), and 0 oth-

erwise (can be checked using gaussian elimination). When the samples are drawn from D1(S)

for some k-dimensional subspace S, then B1 always outputs the correct answer. When the sam-

ples are drawn from a uniform distribution on {0, 1}n, the probability that 8k samples form a

150

k-dimensional subspace is at most

(8kk

)· 127k

≤ (8e)k2−7k < 2−2k ≤ 1/4

(because, if the 8k samples form a k-dimensional subspace, then at least 7k of them are linearly

dependent on the previously stored samples and that happens with at most 1/2 probability for

each sample). Hence, B1 errs with at most 1/4 probability.

Secondly, we observe that there exists a branching program that distinguishes subspaces of

dimension k from uniform using constant width and O(k · 2k) length with probability at least

3/4. Before, we show a randomized algorithm P that distinguishes between D0 and D1(S) for

every S ∈ L(k, n)with high probability. P is described as follows:

1. Repeat steps 2 to 3 sequentially for t = 10 · 2k iterations.

2. Pick a non-zero vector v uniformly at random from {0, 1}n. For the next 2k samples (of

the form a ∈ {0, 1}n), check if ⟨a, v⟩ = 0.

3. If all the 2k samples are orthogonal to v, exit the loop and output 1.

4. Output 0 (None of the randomly chosen vectors were orthogonal to all the samples seen

in its corresponding iteration).

Thenumber of samples seenbyP is 20k·2k. Now,weprove that for every subspaceSofdimension

k, that is, S ∈ L(k, n), P distinguishes betweenD0 andD1(S)with probability at least4

1− 12(e−5 +

102k) ≥ 3/4.

When the samples are drawn fromD0, the probability thatP outputs 1 is equal to the probabil-

ity that in at least one of the t iterations, the randomly chosen non-zero vector vwas orthogonal

to the 2k samples drawn uniformly from {0, 1}n. Here, the probability is over v and the samples.4k ≥ 5.

151

By union bound, we can bound the probability of outputting 1 (error) by

10 · 2k ·(12

)2k

=102k.

For a fixed subspace S ∈ L(k, n), the probability that we pick a non-zero vector v ∈ {0, 1}n

that is orthogonal to S is at least 2n−k−12n−1 > 2−(k+1). Therefore, when the samples are drawn from

D1(S), the probability that P outputs 0 (error) is upper bounded by(1− 1

2k+1

)10·2k ≤ e−5.Here,

the probability is over the vectors v and the samples. Now to construct a constant width but

20k · 2k length branching program that distinguishes with probability at least 3/4, we consider a

bunch of branching programs each indexed by t vectors that are used in step 2 of the algorithm

P. It’s easy to see that for a fixed set of t vectors, P can be implemented by a constant width

branching program. As, when the t vectors are uniformly distributed over {0, 1}n (non-zero), P

can distinguish with probability at least 3/4 for every subspace S ∈ L(k, n), there exists a fixing

to the t vectors such that the corresponding branching program distinguishes between D0 and

D1(S) (when S is chosen uniformly at random from L(k, n)) with probability at least 3/4.


Proof. We will prove through reduction to Theorem 4.1.1. Let B be the branching program

that distinguishes subspaces of dimension k, with width d, length m and success probability12 + s. Using B, we show that there exists a branching program for parity learning over {0, 1}k′

(where k < k′ ≤ n and would be defined concretely below), with width d2k′( 8n2 log ns2 )2,

length mk′( 8n2 log ns2 ) and success probability 1 − 1

n . Hence, Theorem 4.1.1 implies that either

d2k′( 8n2 log ns2 )2 = 2Ω(k′2) or mk′( 8n

2 log ns2 ) = 2Ω(k′). Assuming s ≥ 2−c1k′ (where c1 is a small

enough constant), k ≥ c2 log n, c3where c2, c3 are large enough constants, we get that d = 2Ω(k′2)

orm = 2Ω(k′). As k′ > k: we have shown that if B has success probability at least 12 + 2−c1k (for

small enough constant c1) at distinguishing k-dimensional subspaces, B has width at least 2Ω(k2)

or length 2Ω(k).

152

Firstly, using a simple argument, we show that B can distinguish between subspaces of dimen-

sion k′ − 1 and k′ for some k + 1 ≤ k′ ≤ n with success probability ≥ 12 + s

n . Writing the

expression for success probability from Equation 4.1,

Pru1,...,um∼D0 [B(u1, ..., um) = 0] + PrS∈RL(k,n);u1,...,um∼D1(S)[B(u1, ..., um) = 1]2

=12+ s

=⇒ Pru1,...,um∼D0

[B(u1, ..., um) = 0] + 1− PrS∈RL(k,n);u1,...,um∼D1(S)

[B(u1, ..., um) = 0] = 1+ 2s

=⇒ Pru1,...,um∼D0

[B(u1, ..., um) = 0]− PrS∈RL(k,n);u1,...,um∼D1(S)

[B(u1, ..., um) = 0] = 2s

The last expression on the left hand side can be written as

k+1∑k′=n

(Pr

S∈RL(k′,n);u1,...,um∼D1(S)[B(u1, ..., um) = 0]− Pr

S∈RL(k′−1,n);u1,...,um∼D1(S)[B(u1, ..., um) = 0]

)

(D1(S) = D0 for S ∈ L(n, n) as L(n, n) = {{0, 1}n})

Therefore, there exists k+ 1 ≤ k′ ≤ n such that

PrS∈RL(k′,n);u1,...,um∼D1(S)

[B(u1, ..., um) = 0]− PrS∈RL(k′−1,n);u1,...,um∼D1(S)

[B(u1, ..., um) = 0] ≥ 2sn

(4.2)

We have shown that B can solve the following distinguishing problem, that is, learn b with

success probability at least 12 + s

n : If b = 0, then the distinguisher is given samples from a k′-

dimensional subspace of {0, 1}n, otherwise (whenb = 1), the distinguisher is given samples from

a (k′ − 1)-dimensional subspace of {0, 1}n. Here, the probability is over b, the k′ and (k′ − 1)-

dimensional subspaces and the samples seen by B.

Next, using B, we construct a randomized learning algorithm P for parity learning. The parity

learning problem is as follows: a secret x ∈ {0, 1}k′ is chosen uniformly at random, the learner

wants to learn x from random linear equations over F2, that is, (a, b) where a ∈R {0, 1}k′ and

b = ⟨a, x⟩ (⟨a, x⟩ is the inner product of a and xmodulo 2). P uses B to guess each bit of x one

153

by one as follows:

1. For i ∈ {1, 2, ..., k′}, do Steps 2 to 6.

2. Initiate count0 = 0, count1 = 0. These keep counts for the number of times the following

algorithm outputs 0, 1 respectively as the guess for xi.

3. Pick g to be 0 with probability 12 and 1 with probability

12 . This is a guess for x

i.

4. Let M be the set of all rank-n linear maps M : {0, 1}n → {0, 1}n over F2, that is, the

rows {Mr}r∈[n] are linearly independent. PickM ∈ M uniformly at random.

Let fM : {0, 1}k′ ×{0, 1} → {0, 1}n be defined as fM(a, b) = M · (a−i, b+gai, 0, 0, ...0)

(where · represents matrix-vector product, and (a−i, b + gai) is appended with n − k′

zeroes).

For the next m samples (a1, b1), (a2, b2), ..., (am, bm), P runs B with fM(aj, bj) as B’s jth

sample.

5. If B(fM(a1, b1), ..., fM(am, bm)) outputs 0, then increase count1−g by 1, otherwise, increase

countg by 1. In the discussions below, we will see that we increase the count for xi with

probability at least ( 12 +sn).

6. Repeat steps 3 to 5 for t = 8n2 log ns2 times. If count0 > count1, set x′i = 0 and store, else set

x′i = 1 and store. As we will see below, x′i = xi with probability at least (1− 1n2 ).

7. Output x′ as the guess for x.

Claim 4.3.1. For each x ∈ {0, 1}k′ , if x is the chosen secret, P outputs x′ = x with probability at

least (1− 1n).

Here, the probability is over the samples, all the random guesses g in Step 3 and the linearmaps

M in Step 4.

154

Proof. The probability that a single iteration of Steps 3 to 5 increases the counter for xi is the

probability that B(fM(a1, b1), ..., fM(am, bm)) outputs 1 when xi = g and 0 when xi = 1− g.

Consider the subspace Vg = {(a−i, b + gai) ∈ {0, 1}k′ | a = (a1, ..., ak′) ∈ {0, 1}k′ , b =

⟨a, x⟩}. Here, the additions are modulo 2 and a−i ∈ {0, 1}k′−1 is a with the ith coordinate

deleted. When xi = g, Vg forms a (k′ − 1)-dimensional subspace as (x−i, 1) is orthogonal to all

the vectors inVg. AsM is full rank, the range of fM(a, b) forms a (k′ − 1)-dimensional subspace

too and underM being picked uniformly at random fromM, we get a uniform distribution on

the (k′−1)-dimensional subspaces (L(k′−1, n)). When xi = g, it’s easy to see thatVg = {0, 1}k′

and thus, Range(fM) underM ∈R M is a uniform distribution on L(k′, n). Therefore,

Prg∈R{0,1};M∈RM;

a1,a2,...,am∈R{0,1}k′;∀j∈[m],bj=⟨aj,x⟩

[B(fM(a1, b1), ..., fM(am, bm)) = 1 ∧ xi = g]

+ Prg∈R{0,1};M∈RM;

a1,a2,...,am∈R{0,1}k′;∀j∈[m],bj=⟨aj,x⟩

[B(fM(a1, b1), ..., fM(am, bm)) = 0 ∧ xi = 1− g]

=12

PrS∈RL(k′−1,n);u1,u2,...,um∈RS

[B(u1, ..., um) = 1] + PrS∈RL(k′,n);

u1,u2,...,um∈RS

[B(u1, ..., um) = 0]

=

12

1+ PrS∈RL(k′,n);

u1,u2,...,um∈RS

[B(u1, ..., um) = 0]− PrS∈RL(k′−1,n);u1,u2,...,um∈RS

[B(u1, ..., um) = 0]

≥ 1

2+

sn

The last inequality follows from Equation 4.2.

Next, we prove that x′i = xi with probability at least (1 − 1n2 ) using Chernoff Bound. For

o = 1 to t, let Xo = 1 if we increase countxi in the oth iteration of Steps 3 to 5 for calculating xi,

and 0 otherwise. From the previous argument, we know that E(Xo) ≥ 12 +

sn . As, {Xo}o∈[t] are

independent random variables.

Pr[x′i = xi] = Pr

[t∑

o=1

Xo ≤t2

]≤ Pr

[t∑

o=1

Xo − E(t∑

o=1

Xo) ≤ − tsn

]

155

≤ e−t4 (

sn )

2 ≤ 1n2

Claim 4.3.1 just follows from union bound, that is,

Pr[x′ = x] ≤k′∑i=1

Pr[x′i = xi] ≤ k′(

1n2

)≤ 1

n

Using P, we construct a set of branching programs one for each possible set of guesses g and

linear maps M. Let P[{gio}i∈[k′],o∈[t], {Mi

o}i∈[k′],o∈[t]]represent a branching program that exe-

cutes P with gio as the guess for xi andMio as the linear map in the oth iteration of Step 3 to 5 for

calculating xi.

P[{gio}i∈[k′],o∈[t], {Mi

o}i∈[k′],o∈[t]]can run B on modified samples in Step 4 using the same

width as B, as after fixing the linear map, each modified sample depends only on a single sam-

ple seen by P. And because a branching program is a non-uniform model of computation,

P[{gio}i∈[k′],o∈[t], {Mi

o}i∈[k′],o∈[t]]doesn’t need to store the guesses and maps. It does need to

store x′, count0, count1 in addition to the width of B, where the space for counts is reused for

each i. Therefore, the width (d′) of the branching programs, based on P, is ≤ d2k′(2log t)2 =

d2k′( 8n2 log ns2 )2.

It is easy to see that the length (m′) ofP[{gio}i∈[k′],o∈[t], {Mi

o}i∈[k′],o∈[t]]ismk′t = mk′( 8n

2 log ns2 ).

Through Claim 4.3.1, we know that for each x,

Prgio∈R{0,1};Mi

o∈RM;

a1,a2,...,am′∈R{0,1}k′;∀j∈[m′],bj=⟨aj,x⟩

[P((a1, b1), ..., (am′ , bm′)) = x] ≥ 1− 1n

Therefore,

Prx∈{0,1}k′ ;gio∈R{0,1};Mi

o∈RM;

a1,a2,...,am′∈R{0,1}k′;∀j∈[m′],bj=⟨aj,x⟩

[P((a1, b1), ..., (am′ , bm′)) = x] ≥ 1− 1n

156

The above expression can be rewritten as follows:

Egio∈R{0,1};Mi

o∈RM

Prx∈{0,1}k′ ;a1,a2,...,am′∈R{0,1}k

′;

∀j∈[m′],bj=⟨aj,x⟩

[P((a1, b1), ..., (am′ , bm′)) = x]

≥ 1− 1n

Therefore, there exist guesses {gio}i∈[k′],o∈[t] and linear maps {Mio}i∈[k′],o∈[t] such that

Prx∈{0,1}k′ ;a1,a2,...,am′∈R{0,1}k

′;

∀j∈[m′],bj=⟨aj,x⟩

[P[{gio}i∈[k′],o∈[t], {Mi

o}i∈[k′],o∈[t]]((a1, b1), ..., (am′ , bm′)) = x

]

is at least 1− 1n . This gives us a branching program of width d′ and lengthm′ for parity learning

with success probability at least 1− 1n .

4.4 Sample-Memory Tradeoffs for Resilient Local PRGs

In this section, we prove the lower bound againstmemory bounded algorithms for distinguishing

between streaming outputs of Goldreich’s pseudorandom generator and perfectly random bits.

Before stating and proving the result in detail, we set up some notation and definitions that

will be convenient for us in this section.

4.4.1 Formal Setup

A k-ary predicate P is a Boolean functionP : {0, 1}k → {0, 1}. Let∑

α⊆[k] P(α)χα be the Fourier

polynomial for (−1)P ((−1)P(x) = (−1)P(x)). P is said to be t-resilient if t is the maximum

positive integer such that P(α) = 0 whenever |α| < t. In particular, the parity function ⟨α, x⟩ is

|α|-resilient. Here, χα : {0, 1}k → {−1, 1} is such that χα(x) = (−1)⟨α,x⟩.

Let [n](k) denote the set of all ordered k-tuples of exactly k elements of [n]. That is, no element

of [n] occurs more than once in any tuple of [n](k). For any a ∈ [n](k), let ai ∈ [n] denote

the element of [n] appearing in the ith position in a. Given x ∈ {0, 1}n and a ∈ [n](k), let

xa ∈ {0, 1}k be defined so that (xa)i = xai for every 1 ≤ i ≤ k.

157

For any k-ary predicate P, consider the problem of distinguishing between the following two

distributions on (a, b) ∈ [n](k) × {0, 1}where (a, b) are sampled as follows:

1. Dnull: 1) Choose a uniformly at random from [n](k), and 2) choose b uniformly at random

and independently from {0, 1}.

2. Dplanted(x), x ∈ {0, 1}n: 1) Choose a uniformly at random from [n](k), and 2) set b =

P(xa).

Note that a is chosen uniformly at random from [n](k) in both distributions. However, while

the bit b is independent of a inDnull, it may be correlated with a inDplanted.

A distinguisher for the above problem gets access to m i.i.d. samples ut = (at, bt), t ∈ [m]

from one of Dnull and Dplanted(x) for a uniformly randomly chosen x ∈ {0, 1}n and outputs

either “planted” or “null”. We say that the distinguisher succeeds with probability p if:

Pru1,...,um∼Dnull [L(u1, ..., um) = “null”] + Pr x∈R{0,1}n;u1,...,um∼Dplanted(x)

[L(u1, ..., um) = “planted”]

2≥ p

Note: In the language used in the previous sections, think of “null” as being equivalent to 0

and “planted” being equivalent to 1, that is,Dnull ≡ D0 andDplanted(x) ≡ D1(x). Therefore, the

success probability of the distinguisher L can be written as

Pru1,...,um∼D0 [L(u1, ..., um) = 0] + Prx∈R{0,1}n;u1,...,um∼D1(x)[L(u1, ..., um) = 1]2

≥ p (4.3)

In particular, if x ∈ {0, 1}n is “revealed” to a distinguishing algorithm, then it is easy to use

Θ(log(1/ε)) samples and constant width branching program to distinguish correctly with prob-

ability at least 1− ε betweenDnull andDplanted.

4.4.2 Main Result

The main result of this section is the following sample-memory trade-off for any distinguisher:

158

Theorem 4.4.1. Let P be a t-resilient k-ary predicate. Let 0 < ε < 1 − 3 log 24log n and k < n/c.

Suppose there’s an algorithm that distinguishes between Dnull and Dplanted with probability at least

1/2+ s and uses< nε memory. Then, whenever 0 < t ≤ k < n(1−ε)

6 /3 and s > c1(nt )−( 1−ε

36 )t, the

algorithm requires (nt )( 1−ε

36 )t samples. Here, c and c1 are large enough constants.

Note that when k is a constant, this theorem gives a sample-memory tradeoff even for Ω(n)

memory.

The argument yields a slightly better quantitative lower bound for the special case when P is

the parity function, that is, P(x) = (∑k

i=1 xi) mod 2. We will represent this function by Xor.

Theorem 4.4.2. Let 0 < ε < 1− 3 log 24log n and P be the parity predicate Xor on k = t bits. Suppose

there’s an algorithm that distinguishes between Dnull and Dplanted with probability at least 1/2+ s

and uses< nε memory. Then for k ≤ n/c5, if s > 3(nk )−( 1−ε

18 )k, the algorithm requires (nk )( 1−ε

18 )k

samples.

Weprove bothTheorem4.4.1 and4.4.2 (in Section 4.4.4) via the same sequence of steps except

for a certain quantitative bound presented in Lemma 4.4.1. In the next subsection, we give an

algorithm that takes O(nε+k)kmemory and O(n(1−ε)k) samples, anddistinguishes betweenDnull

and Dplanted for any predicate P, with probability 99/100. Thus, the lower bounds are almost

tight up to constant factors in the exponent for the parity predicate. The question of whether

there exists an algorithm that runs in O(n(1−ε)t) samples and O(nε) memory, and distinguishes

betweenDnull andDplanted with high probability, for t-resilient predicates P, remains open.

4.4.3 Tightness of the Lower Bound

In this section, we observe that there exists an algorithmA that takesO((nε+k)·k log n)memory

andO(n(1−ε)k · (nε + k)) samples, and distinguishes betweenDnull andDplanted for any predicate

P, with probability 99/100 (for nε > 10).

A runs over 4n(1−ε)k · (nε + k) samples and stores the first 2(nε + k) samples (a, b) ∈

[n](k) × {0, 1} such that ai ≤ nε + k,∀i ∈ [k], that is, the bit b depends only on the first5c is a large enough constant

159

nε+ k bits of x under the distributionDplanted(x). IfA encounters less than 2(nε+ k) samples of

the above mentioned form, A outputs 1 (“planted”). Otherwise, A goes over all the possibilities

of the first nε + k bits of x (2nε+k possibilities in total) and checks if it could have generated the

stored samples. If there exists a y ∈ {0, 1}nε+k that generated the stored samples, A outputs 1

(“planted”), otherwise A outputs 0.

It’s easy to see that A uses m = 4n(1−ε)k · (nε + k) samples and at most 2(nε + k) · k log n

memory (as it takes only k log n memory to store a sample). Next, we calculate the probability

of success. Let Zj be a random variable as follows: Zj = 1 if the jth sample (aj, bj) is such that

aij ≤ nε + k,∀i ∈ [k] and 0 otherwise.

Pr[Zj = 1] =|[nε + k](k)|

|[n](k)|≥ n−(1−ε)k

And E[∑m

j=1 Zj] = 4(nε + k). By Chernoff bound, Pr[∑

j Zj < 2(nε + k)] ≤ e−4(nε+k)

8 ≤ 1100 .

Therefore, the probability that A stores 2(nε + k) samples is at least 99/100. It’s easy to see that

A always outputs 1 when the samples are generated fromDplanted(x) for any x.

The probability thatA outputs 1, given that it stored 2(nε+ k) samples, when the samples are

generate fromDnull is equal to the probability that there exists a y ∈ {0, 1}nε+k that could have

generated the stored samples. Let (a′1, b′1), ..., (a′2(nε+k), b′2(nε+k)) be the stored samples. There

are at most 2nε+k sequences of b′1, ..., b′2(nε+k) generated by some y given {a′j}j∈[2(nε+k)]. As, un-

der Dnull, b is chosen uniformly at random from {0, 1}, the probability that there exists y ∈

{0, 1}nε+k that could have generated the stored samples is at most 2nε+k

22(nε+k) = 2−(nε+k) ≤ 1/100.

Hence, the probability of success is at least 99/100.

4.4.4 Proof of Theorems 4.4.1 and 4.4.2

Fix a t-resilient k-ary predicate P. Let B be a branching program of width d and length m that

has a success probability of p = 1/2 + s for distinguishing between Dnull and Dplanted(x) (x is

uniformly distributed over {0, 1}n) for the predicate P.

160

We first use hybrid argument to obtain that the branching program must have a non-trivial

probability of distinguishing with a single sample. Towards this, defineHj(x) to be the distribu-

tion overm samples where the first j samples are drawn fromDplanted(x) and the remainingm− j

samples are fromDnull.

From Equation (4.3) for B, we obtain:

12

Pru1,...,um∼Dnull

[B(u1, ..., um) = 0] + 1− Prx∈R{0,1}n;

u1,...,um∼Dplanted(x)

[B(u1, ..., um) = 0]

≥ 12+ s

=⇒ Prx∈R{0,1}n;

u1,...,um∼Dnull

[B(u1, ..., um) = 0]− Prx∈R{0,1}n;

u1,...,um∼Dplanted(x)

[B(u1, ..., um) = 0] ≥ 2s

The above expression canbewritten as a telescopic sumof the distinguishing probabilities over

them+ 1 hybrids,Hj(x), j ∈ {0, ...,m}.

Prx∈R{0,1}n;

(u1,...,um)∼H0(x)

[B(u1, ..., um) = 0]− Prx∈R{0,1}n;

(u1,...,um)∼Hm(x)

[B(u1, ..., um) = 0]

=m∑j=1

Prx∈R{0,1}n;

(u1,...,um)∼Hj−1(x)

[B(u1, ..., um) = 0]− Prx∈R{0,1}n;

(u1,...,um)∼Hj(x)

[B(u1, ..., um) = 0]

Thus, there is a j′ ∈ {1, ...,m} such that

Prx∈R{0,1}n;

(u1,...,um)∼Hj′−1(x)

[B(u1, ..., um) = 0]− Prx∈R{0,1}n;

(u1,...,um)∼Hj′ (x)

[B(u1, ..., um) = 0] ≥ 2sm

(4.4)

Next, we will show that for 0 < ε < 1 − 3 log(24)log n , d ≤ 2nε and 0 < t ≤ k <

n(1−ε)/6/3, n/c6, B distinguishes between the hybridsHj′−1(x) andHj′(x) with probability of at

most pt ≤ c1(nt )−( 1−ε

18 )t (where c1 is a large enough constant). Therefore, 2sm ≤ c1(nt )−( 1−ε

18 )t.

When P = Xor (t = k), we will achieve better bounds. We show that for 0 < k < n/c, B dis-

tinguishes between the hybridsHj′−1(x) andHj′(x)with probability of atmost p′k ≤ 5(nk )−( 1−ε

9 )k.

6c is a large enough constant.

161

Therefore, 2sm ≤ 5(nk )−( 1−ε

9 )k.

Theorems 4.4.1 and 4.4.2 follows through following observations:

1. For 0 < ε < 1 − 3 log(24)log n and 0 < t ≤ k < {n(1−ε)/6/3, n/c}, as 2s

m ≤ c1(nt )−( 1−ε

18 )t, if

m ≤ (nt )( 1−ε

36 )t, then s ≤ c12 · (nt )

−( 1−ε36 )t.

2. When P = Xor, t = k, for 0 < ε < 1− 3 log(24)log n and 0 < k < n/c, as 2s

m ≤ 5(nk )−( 1−ε

9 )k, if

m ≤ (nk )( 1−ε

18 )k, then s ≤ 52 · (

nk )−( 1−ε

18 )k.

Now, we are ready to prove the upper bound on the capabilities ofB in distinguishing between

the hybridsHj′−1(x) andHj′(x). For 0 < t ≤ k < n/c, let dt = (nt )t.

Let Li be the set of vertices in the layer-i of the branching program B. Let Ej(v) represent the

event and Pj(v) be the probability of reaching the vertex v of the branching program B when x

is picked uniformly at random from {0, 1}n and them samples are drawn fromHj(x). Let Ej(0)

represents the event of B outputting 0 when x is picked uniformly at random from {0, 1}n and

them samples are drawn fromHj(x). Let v1 and v2 be two vertices in the branching program such

that v1 is in an earlier layer than v2. Let Pj(v2 | v1) be the probability of reaching the vertex v2 of

the branching program B given that the computational path also reached the vertex v1, when x is

picked uniformly at random from {0, 1}n and the m samples are drawn from Hj(x). Let Qj(v)

be the probability of the branching program outputting 0 given that it reached vertex v, when x

is picked uniformly at random from {0, 1}n and them samples are drawn fromHj(x). Note that

if v is a vertex in layer-i of the branching program such that i ≥ j,Qj(v) does not change with the

choice of x as all the samples after the jth layer are independently drawn fromD0 (Dnull).

Then, we can rewrite the expression on left hand side of Equation 4.4 as

∑v1∈Lj′−1,v2∈Lj′

Pr[Ej′−1(0) ∧ Ej′−1(v2) ∧ Ej′−1(v1)]−

∑v1∈Lj′−1,v2∈Lj′

Pr[Ej′(0) ∧ Ej′(v2) ∧ Ej′(v1)]

162

For a vertex v2 in the j′th layer, conditioned on the event Ej′(v2), event Ej′(0) is independent of

the event Ej′(v1). Similarly, conditioned on Ej′−1(v2), event Ej′−1(0) is independent of the event

Ej′−1(v1). And as the last m − j′ samples are drawn from the same distribution D0, Qj′(v2) =

Qj′−1(v2).

For a vertex v1 in the (j′ − 1)th layer, both Pj′(v1) and Pj′−1(v1) are equal to the probability of

reaching the vertex v1, when x is picked uniformly at random from {0, 1}n and the first j′ − 1

samples are drawn fromD1(x).

Hence, the expression can be rewritten as

∑v1∈Lj′−1,v2∈Lj′

Pj′−1(v1) · Pj′−1(v2 | v1) · Qj′−1(v2)−

∑v1∈Lj′−1,v2∈Lj′

Pj′(v1) · Pj′(v2 | v1) · Qj′(v2)

=∑

v1∈Lj′−1,v2∈Lj′

Pj′−1(v1) · Qj′(v2) ·(Pj′−1(v2 | v1)− Pj′(v2 | v1)

)

=∑v2∈Lj′

Qj′(v2)

∑v1∈Lj′−1

Pj′−1(v1) ·(Pj′−1(v2 | v1)− Pj′(v2 | v1)

)Let L be the set of vertices in the layer-(j′ − 1) of the branching program B such that ∀v1 ∈

L, Pj′−1(v1) ≥ d−1d−1t . Then, the above expression, can be rewritten as

=∑v2∈Lj′

Qj′(v2)

(∑v1∈L

Pj′−1(v1) ·(Pj′−1(v2 | v1)− Pj′(v2 | v1)

))

+∑v2∈Lj′

Qj′(v2)

(∑v1 ∈L

Pj′−1(v1) ·(Pj′−1(v2 | v1)− Pj′(v2 | v1)

))

≤∑v2∈Lj′

Qj′(v2)

(∑v1∈L

Pj′−1(v1) ·(Pj′−1(v2 | v1)− Pj′(v2 | v1)

))

+∑v1 ∈L

Pj′−1(v1)

∑v2∈Lj′

Qj′(v2) · Pj′−1(v2 | v1)

163

≤∑v2∈Lj′

Qj′(v2)

(∑v1∈L

Pj′−1(v1) ·(Pj′−1(v2 | v1)− Pj′(v2 | v1)

))+

1dt

(4.5)

The last inequality follows from the fact that the width of the branching program is d, for a

vertex v1 not in L, Pj′−1(v1) is at most 1d·dt and that the summation of the expression over v2 can

be at most 1.

Let Px|Ej′ (v) be the probability distribution of the random variable x conditioned on the event

Ej′(v). For notational easiness, we will also denote this distribution by Px|v. We claim that for

all v1 ∈ L, the distribution Px|v1 has min-entropy of at least (n − log(d) − log(dt)), that is,

∀x′ ∈ {0, 1}n, Px|v1(x′) ≤ d · dt · 2−n.

The proof is as follows: as x is chosen uniformly at random from {0, 1}n, for all x′,

∑v1∈L

Pr[x = x′ ∧ Ej′(v1)] ≤ Pr[x = x′] ≤ 2−n

This implies,

∑v1∈L

Pr[x = x′ | Ej′(v1)] · Pj′(v1) ≤ 2−n

=⇒ Pr[x = x′ | Ej′(v1)] ≤ 2−n · 1Pj′(v1)

= 2−n · 1Pj′−1(v1)

≤ 2−n · 1d−1d−1t

=⇒ Px|v1(x′) ≤ d · dt · 2−n (4.6)

Let S(v1,v2) be the set of all the labels (a, b) ∈ [n](k) × {0, 1} such that the edge labeled (a, b)

at vertex v1, goes into vertex v2 in the next layer. Let P(aj′ ,bj′ )|v1 represent the distribution of the j′

sample conditioned on the eventEj′(v1), when x is chosen uniformly at random from {0, 1}n and

them samples are chosen from Hj′(x). As the j′th sample is drawn from the distribution D1(x),

for every a ∈ [n](k),

P(aj′ ,bj′ )|v1(a, b) =∑

x′∈{0,1}nPr[x = x′ | Ej′(v1)] · Pr[(aj′ , bj′) = (a, b) | x = x′]

164

=1

|[n](k)|·

∑x′:P(x′a)=b

Px|v1(x′)

(4.7)

This is because a is chosen uniformly from [n](k) and conditioned on x, the j′th sample is inde-

pendent of v1.

When the samples are drawn fromHj′−1(x), the j′th sample is drawn fromD0 and is indepen-

dent of the event Ej′−1(v1). Therefore, the probability of j′th sample being (a, b) in this hybrid is1

2|[n](k)| for all a ∈ [n](k), b ∈ {0, 1}.

For every v1 ∈ L, we can rewrite the expression(Pj′−1(v2 | v1)− Pj′(v2 | v1)

)as follows:

Pj′−1(v2 | v1)− Pj′(v2 | v1) =∑

(a,b)∈S(v1,v2)

(P(aj′ ,bj′ )|Ej′−1(v1)(a, b)− P(aj′ ,bj′ )|v1(a, b)

)=

∑(a,b)∈S(v1,v2)

(1

2|[n](k)|− P(aj′ ,bj′ )|v1(a, b)

)

=∑

(a,b)∈S(v1,v2)

12|[n](k)|

− 1|[n](k)|

·

∑x′:P(x′a)=b

Px|v1(x′)

=

1|[n](k)|

·

∑(a,b)∈S(v1,v2)

12−

∑x′:P(x′a)=b

Px|v1(x′)

(4.8)

Next, we will show that ∀v1 ∈ L, the above expression∣∣∣ 12 −∑x′:P(x′a)=b Px|v1(x′)

∣∣∣ is small for

most samples (a, b) (Lemma 4.4.1). Define Tl = {a ∈ {0, 1}n :∑n

i=1 ai = l} for l ∈ N.

Lemma 4.4.1. For all v1 ∈ L, 0 < ε < 1− 3 log 24log n , 0 < t ≤ k < {n

c , n(1−ε)

6 /3},

∣∣∣∣∣∑x′

Px|v1(x′) · (−1)P(x′a)∣∣∣∣∣ ≤ c2n−(

1−ε18 )t

for all but at most c2n−(1−ε18 )t fraction of a ∈ [n](k) (recall that P is a t-resilient k-ary predicate).

165

For all v1 ∈ L, 0 < ε < 1− 3 log 24log n , 0 < k < n

c ,

∣∣∣∣∣∑x′

Px|v1(x′) · (−1)Xor(x′a)∣∣∣∣∣ ≤ 2

(nk

)−( 1−ε9 )k

for all but at most 2(nk )−( 1−ε

9 )k fraction of a ∈ [n](k).

Here, c and c2 are large enough constants.

Before, we prove the Lemma, we prove the following claim:

Claim 4.4.1. For all v1 ∈ L, 0 < ε < 1− 3 log 24log n , 0 < t ≤ l < n

c ,

Ea∈Tl

(∑x′

Px|v1(x′) · (−1)⟨a,x′⟩)2

≤ 2(nl

)−( 1−ε3 )l

Proof. As v1 ∈ L, using Equation 4.6, we know that for all values of x′, Px|v1(x′) ≤ 2nε−n · dt.

Ea∈Tl

(∑x′

Px|v1(x′) · (−1)⟨a,x′⟩)2

= Ea∈Tl

(∑x′,x′′

Px|v1(x′) · Px|v1(x′′) · (−1)⟨a,x′+x′′⟩

)

=∑x′,x′′

Px|v1(x′) · Px|v1(x′′) · Ea∈Tl

(−1)⟨a,x′+x′′⟩

=∑x′,z

Px|v1(x′) · Px|v1(z+ x′) · Ea∈Tl

(−1)⟨a,z⟩

Let BTl(δ) = {γ ∈ {0, 1}n | |Ea∈Tl (−1)⟨a,γ⟩| > δ}. Using Lemma 4.1.1 [KRT17], we know

that for 1 ≥ δ ≥ ( 8ln )l2 ,

|BTl(δ)| ≤ 2e−δ2/l·n/8 · 2n

We can rewrite the expression (Ea∈Tl

(∑x′ Px|v1(x′) · (−1)⟨a,x′⟩

)2) as follows:∑

z∈BTl (δ)

∑x′


(−1)⟨a,z⟩

+∑

z ∈BTl (δ)

∑x′


(−1)⟨a,z⟩

166

≤∑

z∈BTl (δ)

∑x′

Px|v1(x′) · Px|v1(z+ x′) · | Ea∈Tl

(−1)⟨a,z⟩|

+∑

z ∈BTl (δ)

∑x′

Px|v1(x′) · Px|v1(z+ x′) · | Ea∈Tl

(−1)⟨a,z⟩|

≤∑

z∈BTl (δ)

∑x′

Px|v1(x′) · Px|v1(z+ x′) +∑

z ∈BTl (δ)

∑x′

Px|v1(x′) · Px|v1(z+ x′) · δ

≤∑

z∈BTl (δ)

∑x′

Px|v1(x′) · Px|v1(z+ x′) + δ

≤ 2nε−n · dt · |BTl(δ)|+ δ

≤ 2e−δ2/l·n/8 · 2nε · dt + δ

The second inequality follows from the definition of BTl(δ) and the fact that

|Ea∈Tl (−1)⟨a,z⟩| ≤ 1 always. The fourth inequality follows from Equation 4.6 as

Px|v1(z + x′) ≤ 2nε−n · dt,∀x′, z. And, the last inequality follows from the bound on

|BTl(δ)|.

Therefore, ∀δ, ( 8ln )l2 ≤ δ ≤ 1, Ea∈Tl

(∑x′ Px|v1(x′) · (−1)⟨a,x′⟩

)2 ≤ 2e−δ2/l·n/8 · 2nε · dt + δ.

For 0 < t ≤ l < n/c, dt = (nt )t ≤ (nl )

l. Let δ = (nl )−( 1−ε

3 )l. As l < n/c where c is large

enough constant, ( 8ln )l2 ≤ δ. Therefore,

Ea∈Tl

(∑x′

Px|v1(x′) · (−1)⟨a,x′⟩)2

≤ 2e−(nl )

−(1−ε) 23 ·n/8 · 2nε ·(nl

)l+(nl

)−( 1−ε3 )l

= 2e−(nl )

13 (1−ε)+ε·l/8 · 2nε ·

(nl

)l+(nl

)−( 1−ε3 )l

≤ 2−(nl )

13 (1−ε)+ε·l/8+nε+l log(n/l)+1 +

(nl

)−( 1−ε3 )l

≤ 2−l log(n/l) +(nl

)−( 1−ε3 )l

≤ 2(nl

)−( 1−ε3 )l

Here, the inequalities follow by assuming that l < n/c, for large enough c such that (n/l)1/3 >

36 log(n/l), and 0 < ε < 1− 3 log 24log n such that n 1

3 (1−ε) ≥ 24. The second last inequality follows

167

from the following calculations.

(nl

) 13 (1−ε)+ε

· l/8 = 124

n13 (1−ε)nεl

23 (1−ε) +

336

l(nl

) 13(nl

) 23 ε

> nε + 3l log(n/l)

This proves the claim.

Proof. of Lemma 4.4.1. We first prove the statement for the general t-resilient k-ary predicate P

for 0 < ε < 1− 3 log 24log n such that 0 < t ≤ k < {n

c , n(1−ε)

6 /3}. Using Claim 4.4.1, we know that

for all 0 < t ≤ l < nc ,

Ea∈Tl

(∑x′

Px|v1(x′) · (−1)⟨a,x′⟩)2

≤ 2(nl

)−( 1−ε3 )l

We consider the following expression,

Ea∈[n](k)

(∑x′

Px|v1(x′) · (−1)P(x′a))2

Substituting (−1)P with its Fourier expansion, we get

Ea∈[n](k)

(∑x′

Px|v1(x′) · (−1)P(x′a))2

= Ea∈[n](k)

∑x′

Px|v1(x′) ·∑α⊆[k]

P(α)χα(x′a)

2

= Ea∈[n](k)

∑α⊆[k],|α|≥t

P(α)∑x′

Px|v1(x′) · χα(x′a)

2

(4.9)

≤ Ea∈[n](k)

∑α⊆[k],|α|≥t

P(α)2 ∑

α⊆[k],|α|≥t

(∑x′


)2 (4.10)

= Ea∈[n](k)

∑α⊆[k],|α|≥t

(∑x′


)2

(4.11)

168

=k∑j=t

∑α⊆[k],|α|=j

Ea∈[n](k)

(∑x′


)2

Equality 4.9 follows from the fact that P is t−resilient. Inequality 4.10 follows from Cauchy-

Schwarz. Equality 4.11 follows from Parseval’s identity. Let v(a, α) be a n-bit vector defined as

follows: ∀i ∈ [k], set v(a, α)ai = 1 if only if αi = 1 (and 0 otherwise). It’s easy to see that

χα(x′a) = (−1)⟨x′,v(a,α)⟩ and |v(a, α)| = |α|. And, for a fixed α, when a is uniformly distributed

over [n](k), v(a, α) is uniformly distributed over T|α|. Therefore, the above expression can be

rewritten as,

Ea∈[n](k)

(∑x′

Px|v1(x′) · (−1)P(x′a))2

=k∑j=t

∑α⊆[k],|α|=j

Ev∈Tj

(∑x′

Px|v1(x′) · (−1)⟨v,x′⟩)2

≤ 2 ·k∑j=t

∑α⊆[k],|α|=j

(nj

)−( 1−ε3 )j (4.12)

≤ 2 ·k∑j=t

((kj

)·(nj

)−( 1−ε3 )j)

≤ c′ ·k∑j=t

((ekj

)j

· j(1−ε3 )j

(n1−ε3 )j

)(4.13)

≤ c′ ·k∑j=t

n−(1−ε)

6 j ≤ 2c′ · (n−(1−ε)

6 t) (4.14)

Inequality 4.12 follows fromClaim 4.4.1. For large enough c′, Inequality 4.13 follows from Ster-

ling’s bound on factorials. As k ≤ n 1−ε6 /3 and assuming n 1−ε

6 ≥ 2, Inequality 4.14 follows.

Therefore,

Ea∈[n](k)

(∑x′

Px|v1(x′) · (−1)P(x′a))2

≤ c′′ · n−(1−ε)

6 t

for large enough constant c′′. From this expression, it’s easy to see that

∣∣∣∣∣∑x′

Px|v1(x′) · (−1)P(x′a)∣∣∣∣∣ ≤ c2 · n−

(1−ε)18 t

169

for all but at most c2 · n−(1−ε)

18 t fraction of a ∈ [n](k) (c2 = c′′1/3).

Now, we prove the lemma for the special case of P = Xor. As parity function is symmetric,

Ea∈[n](k)

(∑x′

Px|v1(x′) · (−1)Xor(x′a))2

= Ea∈Tk

(∑x′

Px|v1(x′) · (−1)⟨a,x′⟩)2

Therefore, using Claim 4.4.1, we have shown that

Ea∈[n](k)

(∑x′

Px|v1(x′) · (−1)Xor(x′a))2

≤ δk = 2(nk

)−( 1−ε3 )k

for 0 < k < n/c.

From these expressions, it’s easy to see that |∑

x′ Px|v1(x′) · (−1)Xor(x′a)| ≥ (δk)1/3 for at most

(δk)1/3 fraction of the values of a ∈ [n](k) which completes the proof of the lemma.

Now, we come back to bounding the capabilities of B to distinguishing between the j′− 1 and

j′ hybrids (pt, p′t). Substituting the expression for Pj′−1(v2 | v1) − Pj′(v2 | v1) obtained from

Equation 4.8 in Equation 4.5, we get that

Prx∈R{0,1}n;

(u1,...,um)∼Hj′−1(x)

[B(u1, ..., um) = 0]− Prx∈R{0,1}n;

(u1,...,um)∼Hj′ (x)

[B(u1, ..., um) = 0]− 1dt

≤∑v2∈Lj′

Qj′(v2)

∑v1∈L

Pj′−1(v1) ·

1|[n](k)|

·

∑(a,b)∈S(v1,v2)

12−

∑x′:P(x′a)=b

Px|v1(x′)

=∑v1∈L

Pj′−1(v1)

∑v2∈Lj′

Qj′(v2) ·

1|[n](k)|

·

∑(a,b)∈S(v1,v2)

12−

∑x′:P(x′a)=b

Px|v1(x′)

≤∑v1∈L

Pj′−1(v1)

∑v2∈Lj′

Qj′(v2) ·

1|[n](k)|

·

∑(a,b)∈S(v1,v2)

∣∣∣∣∣∣ 12 −∑

x′:P(x′a)=b

Px|v1(x′)

∣∣∣∣∣∣

(4.15)

≤∑v1∈L

Pj′−1(v1)

1|[n](k)|

·∑v2∈Lj′

∑(a,b)∈S(v1,v2)

∣∣∣∣∣∣ 12 −∑

x′:P(x′a)=b

Px|v1(x′)

∣∣∣∣∣∣ (4.16)

170

=∑v1∈L

Pj′−1(v1)

Ea∈R[n](k)

∑b∈{0,1}

∣∣∣∣∣∣ 12 −∑

x′:P(x′a)=b

Px|v1(x′)

∣∣∣∣∣∣ (4.17)

Inequality 4.15 follows just from taking the absolute values. Inequality 4.16 follows from the

fact that Qj′(v2) ≤ 1 for all v2 ∈ Lj′ . Equality 4.17 follows from the fact that each edge labelled

by (a, b) ∈ [n](k) × {0, 1} goes into some vertex in the next layerLj′ .

For the general t− resilient k-ary predicate P, Lemma 4.4.1 showed that, for all but c2n−(1−ε18 )t

fraction of a ∈ [n](k),

∑b∈{0,1}

∣∣∣∣∣∣ 12 −∑

x′:P(x′a)=b

Px|v1(x′)

∣∣∣∣∣∣ ≤ c2 · n−(1−ε18 )t

As the maximum value of this expression is 1, we have shown that

pt ≤∑v1∈L

Pj′−1(v1) · 2c2 · n−(1−ε18 )t +

(nt

)−t≤ c1

(nt

)−( 1−ε18 )t

for large enough constant c1 = 2c2 + 1.

For the special case of P = Xor, using a similar argument, we can show that for 0 < k < n/c,

as dk =(nk

)k,p′k ≤

∑v1∈L

Pj′−1(v1) · 2 · 2(nk

)−( 1−ε9 )k

+(nk

)−k≤ 5 ·

(nk

)−( 1−ε9 )k

.

This completes the proofs of Theorem 4.4.1 andTheorem 4.4.2. We note that the abovemen-

tioned proof is inspired from [NZ96], which proved that if the pseudorandombits are generated

using an extractor, then they fool space-bounded computation.

171

If you don’t like the road you’re walking,

start paving another one.

Dolly Parton

5Pseudorandom Pseudo-Distributions for

ROBPs

The results in this chapter are based on joint work with Mark Braverman and Gil Co-

hen [BCG18].

In this chapter, we focus on the implications of bounded-space for randomized algorithms.

Space-bounded algorithms are typically studied by considering their non-uniform counter-

parts – read-once branching programs. Pseudorandom generators, hitting sets and pseudoran-

dom pseudo-distributions for read-once branching programs can be used for derandomization

of space-bounded computation. In this chapter, we introduce the notion of pseudorandom

172

pseudo-distributions for read once branching programs and provide new constructions for the

same, with seed length that has near-optimal dependence on error. From now on, we will fo-

cus on read once branching programs and please refer to Section 0.2 for related work and formal

connection to space-bounded algorithms.

A length n, width w read-once branching program (ROBP) is a directed graph whose nodes,

called states, are partitioned to n layers, each consists of at most w states, as well as an additional

“start” state. The last layer consists of 2 states called “accept” and “reject”. From every state

but for the latter two, there are two outgoing edges, labeled by 0 and 1, to the following layer.

On input x ∈ {0, 1}n, the computation proceeds by following the edges according to the labels

given by the bits of x starting from the start state. The string x is accepted by the program if the

computation ends in the accept state.

We say that a distribution D on n-bit strings is (n,w, ε)-pseudorandom if for every length n,

width w ROBP, a path (string) that is sampled from D has, up to an additive error ε, the same

probability to end in the accept state as a truly random path. A truly random path corresponds

to a path picked uniformly at random from the 2n possible paths. An (n,w, ε)-pseudorandom

generator (PRG) is an algorithm ({0, 1}l → {0, 1}n) that takes in l bits and outputs n bits such

that the output distribution (uniform distribution over the range) is (n,w, ε)-pseudorandom.

The seed length of a PRG is the number of random bits it requires to generate the distribution,

that is, l. Informally, we call the PRG explicit, if each output bit can be computed efficiently

given the input and the index, that is, inO(log n)-space.

One can prove the existence of an (n,w, ε)-PRG with seed length O(log(nw/ε)). The proof

is non-constructive and hence, the PRG isn’t efficient. In his seminal paper, Nisan [Nis92] gave

an explicit construction of a PRGwith seed lengthO(log n · log(nw/ε)).

An (n,w, ε)-hitting set is a set of n-bit strings such that for every length n, width w ROBP,

whenever a truly randompath ends in the accept state with probability at least ε, then there exists

a path in the set that ends at the accept state. Hitting sets can be used to derandomizeRL. Even

for the deceptively simple looking problem of constructing hitting sets for widthw = 3 ROBPs,

173

no progress wasmade for nearly two decades since [Nis92], until theworks of [ŠŽ11,GMR+12].

In particular, [GMR+12] construct near-optimal hitting sets in that setting. The best known

explicit hitting set (upto polyloglog factors), for general regime of parameters, was in factNisan’s

PRG [Nis92] (with seed length O(log n · log(nw/ε))) until [BCG18]. Subsequently, [HZ18]

gave a much simpler construction for hitting sets that has optimal dependence on error.

There has been much success in constructing PRGs for restricted types of ROBPs (see,

e.g., [INW94, NZ96, RTV06, BPW11, Ste12, BPW12, KNP11, KNP11, De11, IMZ12,

GMR+12, GMRZ13, RSV13, SVW14,GV17] and references therein) such as permutation and,

more generally, regular ROBPs [BRRY14, BV10]. These are programs in which every state but

for start, accept and reject, has in-degree 2. However, unrestricted ROBPs, namely, programs

in which the edges can be placed arbitrarily, proved more challenging and no improvement over

Nisan’s generator has been made in general regime of parameters.

In this work [BCG18], we obtain the first improved constructions of hit-

ting sets for unrestricted ROBPs (for any width w) by constructing hit-

ting sets with near-optimal dependence on ε (precisely, the seed length is

O(log(w/ε) log log(w/ε) + log2(n) · log

(log(1/ε)log n

)+ log n · logw

)). In fact, we intro-

duce and construct a new type of primitive we call a pseudorandom pseudo-distribution 1 that,

informally speaking, lies between hitting sets and pseudorandom distributions. We find this

notion to be of independent interest. After [BCG18], [CL20] simplified the construction for

pseudorandom pseudo-distributions, and achieved optimal dependence on error, improving

polyloglog factors.

Definition 5.0.1 (Pseudorandom pseudo-distributions). Let ρ1, . . . , ρ2s ∈ R and p1, . . . , p2s ∈

{0, 1}n. The sequence D = ((ρ1, p1), . . . , (ρ2s , p2s)) is an (n,w, ε)-pseudorandom pseudo-

distribution if for every length n, width w ROBP, the sum of all ρi’s for which the respective paths

pi end in the accept state is an ε-approximation to the probability of ending at the accept state by1The term “pseudo-distribution” is used in different contexts tomean different things, all under the general idea

that the object at hand shares some desired properties with a “proper” distribution. The closest research field inwhich the term pseudo-distributions is used (with a different meaning than ours) is Sum of Squares. However, wedo not believe this will cause any confusion.

174

taking a truly random path in the program.

We stress that Definition 5.0.1 allows the ρi’s to take both positive and negative values. These

values are not necessarily bounded by 1 in absolute value, nor by any constant for thatmatter, and

they do not necessarily sum up to 1. Indeed, in the new construction, it is possible that |ρi| =

poly(nw/ε). Nevertheless, the definition requires that the numbers cancel out nicely so that

summing the ρi’s of the respective paths that arrive to the accept state yields an ε-approximation

for the probability of arriving to the accept state by taking a truly randompath (and, in particular,

the sum is a number in [−ε, 1+ ε]). An (n,w, ε)-pseudorandom pseudo-generator (PRPG) is an

algorithm ({0, 1}l → R × {0, 1}n) that takes in l bits and outputs a real number and an n-bit

string such that the output distribution (sequence achieved by iterating over all l-bit inputs) is

a (n,w, ε)-pseudorandom pseudo-distribution. Informally, we call the PRPG explicit, if each

output bit (and the real number) can be computed efficiently given the input and the index, that

is, inO(log n)-space2.

Pseudorandom pseudo-distributions yield hitting sets. Observe that, if one sim-

ply ignores the ρi’s, and considers the set of paths {p1, . . . , p2s} in an (n,w, ε)-pseudorandom

pseudo-distribution, one obtains an (n,w, ε′)-hitting set for any ε′ > ε. Indeed, consider a pro-

gram in which the probability to reach the accept state is at least ε′. Then, the sum of ρi’s which

correspond to paths pi ending in the accept state is at least ε′ − ε > 0. Surely then, at least one

path pi ends in the accept state.

Pseudo-distributionsareasgoodasdistributions forderandomizingBPL. By

the above, a pseudorandom pseudo-distribution suffices to derandomize one-sided error ran-

domized algorithms. In fact, more is true. While D is not a distribution per se, it is as good

as such for the purpose of derandomizing two-sided error randomized algorithms, at least when

using the naïve derandomization method described above. Indeed, the straightforward deran-

domization using a pseudorandom (proper) distribution, which sums the probability mass of2The construction of this chapter is O(log2(n) + log(n) log(w) + log(1/ε))-computable.

175

the relevant paths, works just as well for pseudo-distributions as one can sum up the ρi’s which,

in some sense, generalize the probability mass. Of course, the space requirement now depends

on∑

i |ρi| (also assuming there’s an explicit PRPG corresponding to the pseudo-distribution).

5.0.1 Main result

Themain contribution of this work is an explicit construction (computable in O(log2 n+log n ·

logw+log(1/ε)) space) of a pseudorandompseudo-distributionwith near-optimal dependence

on ε. This, in particular, yields the first improved construction of hitting sets for unrestricted

ROBPs.

Theorem 5.0.1 (Main result). For all integers n,w ≥ 1 and 0 < ε < 1/n, there exists an explicit

(n,w, ε)-pseudorandom pseudo-distribution with seed length

O (log(n) log(nw) + log(1/ε)) .

In particular, for w = n the seed length is O(log2 n+ log(1/ε)).

See Theorem 5.3.1 for the full statement and a discussion on the explicitness of the construc-

tion. Consider, for simplicity, the setting where w = n. Further, for ease of discussion, ignore

double-logarithmic factors. Recall that Nisan’s generator has seed length O(log n · log(n/ε))

whereas the optimal seed length is O(log(n/ε)). That is, the problem is all about “shaving

off” the redundant log n factor. In Theorem 5.0.1, we are able to shave off this factor from

the log(1/ε) term and obtain near-optimal dependence on ε in the setting of pseudorandom

pseudo-distributions (and, thus, for hitting sets). Whereas the result doesn’t give any better de-

randomization of BPL orRL, it strictly improves upon prior works when log(1/ε) = ω(log n),

a regime of parameters that is well-motivated by the work of Saks and Zhou [SZ99].

At a very high level, the underlying idea behind the construction is to work with a rough ap-

proximation for the probability of acceptance together with a sequence of finer and finer cor-

rection terms, which add up to yield the desired error guarantee. Generating and maintaining

176

these correction terms require the flexibility of working with negative, unbounded, weights. In

Section 5.1, we give a detailed overview of the proof of Theorem 5.0.1 in which we emphasize

the main ideas and new techniques.

5.1 Proof Overview

Unfortunately, the construction is fairly involved and the analysis requires a significant amount of

work. To guide the reader through the formal proof, in this section we give an informal overview

of the construction and its analysis. This section is not required for the sequel and canbe skipped,

though we believe the informal manner in which it is written and the discussions it contains are

of value.

We start this section by presenting the well-known reduction from the problem of construct-

ing PRGs for ROBPs to the problem of sparsifying matrix product. Then, in Section 5.1.2,

we rederive Nisan’s result via samplers rather than using hash functions as was done origi-

nally [Nis92], expander graphs [INW94], or seeded extractors [RR99]. While not improving

upon previous works, in this section we present the notion of a sampler [BR94], which plays

a key role in the construction, and show how it can be used for constructing PRGs. In Sec-

tion 5.1.3, we introduce andmotivate the idea of working with differences, or delta, of samplers.

This discussion, even being very informal, should be helpful in guiding the reader through the

following sections. In Section 5.1.4 we introduce the notion of amatrix-bundle sequence (MBS)

and its smallness; define multiplication rules for MBSs in Section 5.1.5 and Section 5.1.6, and

proceed from there to describe the construction and its analysis.

5.1.1 The reduction to sparsifying matrix product

It is folklore that the problem of constructing PRGs for ROBPs can be reduced to the problem

of sparsifying matrix product or, more precisely, the product of matrices when represented in

a certain way. To describe this reduction, consider a length n, width w ROBP. The transition

between a pair of consecutive layers Pt, Pt+1 in the program can be represented as the average of

177

two w × w zero-one matricesMt = (M0t + M1

t)/2, where (M0t )i,j = 1 if and only if the edge

labeled by 0 that is going out of state i in layer t ends in state j of layer t+1. M1t is similarly defined

with respect to edges labeled by 1. Note that for every t, the matrixMt is stochastic. Thus,Mt

represents a single step from layer t to t + 1 when the bit is uniformly distributed over {0, 1}.

Mstart,accept represents the acceptance probability when traversing a truly random path in P. In

these terms, the goal is then to approximate thematrix productM = M1M2 · · ·Mn in bounded

space. More precisely, given indices i, j ∈ [w] as inputs and access to any entry of the matrices,

one would like to compute an ε-approximation toMi,j.

Slightly deviating from previous works, the most suitable measure of approximation for the

new construction is obtained by using the infinity norm. The infinity norm of a w × wmatrix

A, is defined by ∥A∥∞ = maxi∈[w]∑w

j=1 |Ai,j|. We say that two matrices A,B are ε-close, or

that A ε-approximates B, if ∥A − B∥∞ ≤ ε. As with any norm, ∥ · ∥∞ is sub-additive, namely,

∥A + B∥∞ ≤ ∥A∥∞ + ∥B∥∞. We make use of two further properties of the infinity norm.

First, ∥ · ∥∞ is sub-multiplicative, namely, ∥AB∥∞ ≤ ∥A∥∞∥B∥∞. Second, ∥A∥∞ = 1 for any

stochastic matrix A.

Now, clearly, one can expand

M = 2−nn∏t=1

(M0t +M1

t) = Er∼{0,1}n

n∏t=1

Mrtt .

Note thatMrtt (i, j) is 1 if there is an edge from ith vertex of layerPt to jth vertex of layerPt+1 labelled

rt. Therefore, the RHS can be thought of as taking all paths in the ROBP, namely, one for each

choice of r ∈ {0, 1}n. A productive point of view for the construction of PRGs for ROBPs is

that of sparsifying the above product such thatM is ε-approximated by Er∼H∏n

t=1Mrtt , ending

up with a small set of pathsH ⊆ {0, 1}n to average on.

Firstly, we introduce some notation and give intuition for a recursive sparsification process.

Let A = (A1, . . . ,A2s) be a sequence of w × w stochastic matrices. From here on, all matrices

in this section are of order w × w. The matrix that is realized by A is given by ⟨A⟩ = Ei [Ai].

178

Similarly, let B = (B1, . . . ,B2s) and ⟨B⟩ = Ej [Bj]. Assume that ⟨A⟩ εA-approximates some

matrix of interest A and ⟨B⟩ εB-approximates B (such that ∥A∥∞, ∥B∥∞ ≤ 1). We think of s as

the complexity of the “representation” and would like to keep it small. If one wishes to approxi-

mate the product AB, the natural approach would be to consider the product of approximations

⟨A⟩⟨B⟩ = Ei,j∼[2s] AiBj. Indeed, using the properties of ∥ · ∥∞, we have that

∥AB− ⟨A⟩⟨B⟩∥∞ = ∥AB− ⟨A⟩B+ ⟨A⟩B− ⟨A⟩⟨B⟩∥∞

≤ ∥AB− ⟨A⟩B∥∞ + ∥⟨A⟩B− ⟨A⟩⟨B⟩∥∞

≤ ∥A− ⟨A⟩∥∞∥B∥∞ + ∥⟨A⟩∥∞∥B− ⟨B⟩∥∞

≤ ∥A− ⟨A⟩∥∞ · 1+ 1 · ∥B− ⟨B⟩∥∞

≤ εA + εB. (5.1)

Thus, taking the product of the approximations ⟨A⟩, ⟨B⟩ yields a very good approximation guar-

antee. However, taking this product is costly in terms of representation as it doubles the com-

plexity of the representation from s to 2s, that is, the expectation is over 22s terms. To reduce the

number of terms, we want to sparsify the product of the two matrix representations.

This approach was taken by many previous works, either implicitly or explicitly using hash

functions [Nis92, Nis94], expander graphs [INW94, RV05], and seeded extractors [NZ96,

RR99, BRRY14, Arm98]. We are going to describe such derandomization based on samplers.

Besides being a natural perspective, we work with samplers because, for the construction in this

chapter, we require flexibility that we only know how to obtain using samplers (see Section 5.1.5

for details). Interestingly enough, though, the constructions of the samplers we make use of

are based on expander graphs and seeded extractors. In the next section we rederive Nisan’s re-

sult [Nis92] via samplers to prepare the ground for the main result subsequently).

179

5.1.2 DerivingNisan’s result via samplers

In this section, we show, on a high level, that a product of n w×w stochastic matrices can be ap-

proximated to an error of ε by a set of 2O(log2(n)+log(n) log(w/ε)) many products (with each matrix

being 0-1 stochastic) using the method of sparsification. This also gives Nisan’s PRG through

observations made in the last section. Informally, a sampler is a randomized algorithm that, with

high probability over its randomness, yields a good approximation for the expectation of any

bounded function by querying the latter on a small number of points. A sampler has two param-

eters: the query complexity that determines how many queries are required by the sampler, and

its randomness complexity, which is the number of truly random bits required for the sampling.

An averaging sampler is a special type of sampler where the randomness is only used to select the

points on which to query the function, independently of the function being considered. Only

then the function is queried, and the output is the average of the corresponding values.

In the following definition, and throughout the chapter, we use the graph-theoretic perspec-

tive of averaging samplers and use the term sampler instead of an averaging sampler. More on

samplers can be found in the excellent survey by Goldreich [Gol11] and in Vadhan’s excellent

monograph [Vad11].

Definition 5.1.1 (Samplers [BR94]). A left-regular bipartite graph G = (L,R,E) is an (ε, δ)-

sampler if for every function f : R → [0, 1], for all but δ-fraction of vertices v ∈ L it holds that

∣∣∣ Ei∼Γ(v)

[f(i)]− Ei∼R

[f(i)]∣∣∣ ≤ ε.

Here Γ(v) is the set of neighbors of v in G. The left-degree of G is called the degree of the sampler.

Observe that given a graph G as in Definition 5.1.1, the randomized algorithm that performs

the sampling process simply uses its randomness to select a vertex v ∈ L uniformly at random,

and then outputs the average Ei∼Γ(v) f(i).

Now that samplers have been defined, we show how they can be used to sparsify matrix

product or, more precisely, the product of the representations of the respective matrices. Let

180

A = (A1, . . . ,A2s), B = (B1, . . . ,B2s) be as before. Recall, ∀i Ai,Bi are w× w stochastic matri-

ces. Given a left-regular bipartite graphG = ([2s], [2s],E)with degree 2d, define the sequence

A ◦G B = C = (Ci,j)i∈[2s],j∈[2d]

as follows: for i ∈ [2s] and j ∈ [2d], Ci,j = AiBΓ(i,j), where Γ(i, j) denotes the j’th neighbor of

vertex i inG. Note that Ci,j are all stochastic. In particular, they are 0-1 if Ai and Bj are 0-1 (that

is, all the entries are either 0 or 1). We now prove

Lemma 5.1.1. LetA and B be sequence of w× w stochastic matrices as defined above. Let G be as

above and 0 < ε, δ < 1. If G is an (ε, δ)-sampler then ∥⟨A ◦G B⟩ − ⟨A⟩⟨B⟩∥∞ ≤ w2(ε+ δ).

Proof. Note that

⟨C⟩ = Ei∼[2s],j∼[2d]

[Ci,j]= E

i∼[2s]

[Ai E

j∼Γ(i)Bj

].

Therefore, for every fixed α, β ∈ [w],

⟨C⟩α,β =w∑

γ=1

Ei∼[2s]

[(Ai)α,γ E

j∼Γ(i)(Bj)γ,β

].

For a fixed γ ∈ [w], consider the function fγ,β : [2s] → [0, 1] that is given by fγ,β(j) = (Bj)γ,β.

Note that the range of fγ,β is indeed [0, 1] as Bj are all stochastic matrices. Define

εγ,β(i) = Ej∼Γ(i)

[fγ,β(j)

]− ⟨B⟩γ,β.

Informally, as ⟨B⟩γ,β = Ej∼[2s][fγ,β(j)

], the quantity εγ,β(i)measures the quality of the approxi-

mation for the function fγ,β from thepoint of viewof vertex i, that is, when the points are sampled

using the neighborhood of i. We have that

⟨C⟩α,β =w∑

γ=1

Ei∼[2s]

[(Ai)α,γ(⟨B⟩γ,β + εγ,β(i))

]

181

=w∑

γ=1

⟨A⟩α,γ⟨B⟩γ,β +w∑

γ=1

Ei∼[2s]

[(Ai)α,γεγ,β(i)

]= (⟨A⟩⟨B⟩)α,β +

w∑γ=1

Ei∼[2s]

[(Ai)α,γεγ,β(i)

].

As Ai are all stochastic, for every i ∈ [2s]we have that

∣∣⟨C⟩α,β − (⟨A⟩⟨B⟩)α,β∣∣ ≤ w∑

γ=1

Ei∼[2s]

[∣∣(Ai)α,γ∣∣ ∣∣εγ,β(i)∣∣]

≤w∑

γ=1

Ei∼[2s]

∣∣εγ,β(i)∣∣.AsG is an (ε, δ)-sampler, for all γ, for all but δ-fraction of i ∈ [2s], it holds that |εγ,βs(i)| ≤ ε and

so ∣∣⟨C⟩α,β − (⟨A⟩⟨B⟩)α,β∣∣ ≤ w(ε+ δ).

The lemma follows as the above bound holds for every α, β.

Equation (5.1) and Lemma 5.1.1 readily imply that if ⟨A⟩ is an εA-approximation for some

matrix A of interest and ⟨B⟩ εB-approximates B then

∥⟨A ◦G B⟩ − AB∥∞ ≤ ∥⟨A ◦G B⟩ − ⟨A⟩⟨B⟩∥∞ + ∥⟨A⟩⟨B⟩ − AB∥∞

≤ εA + εB + w2(ε+ δ). (5.2)

Thus, one pays an additional error of w2(ε + δ) in the resulting approximation, compared to

taking the actual product, when using the sparsified product parameterized by the (ε, δ)-sampler

G. The advantage, however, is that now the expectation is over way less that 22s terms as indeed

A ◦G B is a sequence of length 2s+d (rather than 22s), where 2d is the degree of the sampler.

It is now a question of how the degree of a sampler relates to the parameters ε, δ. It turns

out that, based on expander graphs and, in particular, Ramanujan graphs, one can construct an

(ε, δ)-sampler with degreeO(ε−2δ−1) [GW97]. As ε, δ play the same role in the bound that was

182

derived in Lemma 5.1.1 and the degree has roughly the same dependence on ε, δ (none of which

is the case in the main construction as discussed in Section 5.1.5) we set ε = δ and consider a

sampler of degreeO(δ−3) to illustrate Nisan’s construction.

Going from 2 to nmatrices

To approximate the product of n stochastic matrices (M1,M2, ...,Mn), one can apply recur-

sion. LetM[i,j] represent the product ofmatricesMi,Mi+1, ...Mj. Let A be an approximation of

M[1,2r] andBbe an approximationofM[2r+1,22r], then ⟨A ◦G B⟩ gives an approximationofM[1,22r].

We iterate the process for log(n) levels to get an approximation of product ofnmatrices. Ifwe de-

note the approximation guarantee for multiplying 2r matrices by ε(r) then Equation (5.2) yields

the recursive relation ε(r) = 2ε(r − 1) + 2w2δ, and so ε(r) = O(2rw2δ). Further, if one de-

notes by s(r) the complexity of the representation at level r, that is the expectation is over 2s(r)

terms, one has s(r) = s(r− 1) +O(log(1/δ)), yielding s(r) = O(r log(1/δ)). If ε′ is the approx-

imation guarantee one is aiming for, one must set δ = O(2−rε′/w2) which yields complexity

s(r) = O(r2 + r log(w/ε′)). Plugging r = log n, the depth of the recursion, we rederive Nisan’s

result, namely, the seed length of the respective PRG, is O(log n · log(nw/ε)). The explicitness

of the PRG (computability in at least O(log n · log(nw/ε)) space) follows from the log-space

computability (log in the size of the bipartite graph of the sampler) of the neighbthe newhood

function of the sampler.

We remark that by using the samplers that are constructed via expander graphs, the construc-

tion above is in fact exactly the one introduced in [INW94], though the analysis is conceptually

different. Building on the notations and ideas presented so far, in the following section we sig-

nificantly deviate from existing ideas and start to describe the new construction.

Before proceeding further, we observe that from the way in which one sparsifies matrix prod-

uct, it is possible to obtain a description of the pseudorandom distribution or, equivalently, the

PRG. Thus, throughout the chapter we only consider sparsifying matrix products and do not

explicitly define the induced pseudorandom pseudo-distribution for that matter. We find this

183

point of view far more suitable for the construction and its analysis. We point out that the con-

struction stated space complexity follows from the space complexity guaranteed by the samplers

that we use. We elaborate more on the corresponding PRPG in Section 5.8.2.

5.1.3 Delta of samplers–a preliminary discussion

By inspecting the construction from the previous section, one can see that the reason the seed

length ended up beingO(log n · log(nw/ε)) is that we had to set δ so low so as to guarantee that

the accumulation of errors from all n products will not exceed ε. The main conceptual novelty

of the new construction is in working with differences, or delta, of samplers. We motivate this

reasoning in the following informal discussion.

Assume, as before, that A = (A1, . . . ,A2s) and B = (B1, . . . ,B2s) are sequences such that

⟨A⟩, ⟨B⟩ are ε-approximations for some matrices of interest A, B, respectively. For an integer d,

letGd = ([2s], [2s],Ed) be an (ε, δ)-sampler set with ε = δ = 2−d. Recall that the degree ofGd is

2O(d). In the previous section, we used an expensive choice of d = O(log(wn/ε)) ≜ k. Instead,

let’s try to “break down” the matrix that is realized by this expensive product by suggestively

writing ⟨A ◦Gk B⟩ as

⟨A ◦Gk B⟩ =⟨A ◦Gg B⟩+

⟨A ◦G2g B⟩ − ⟨A ◦Gg B⟩+

⟨A ◦G4g B⟩ − ⟨A ◦G2g B⟩+...

⟨A ◦Gk B⟩ − ⟨A ◦Gk/2 B⟩, (5.3)

where g ≪ k is some parameter such that k/g is conveniently a power of two. Consider now a

summand in this telescopic sum, say, ⟨A ◦G2g B⟩−⟨A ◦Gg B⟩. We are going to define a newmulti-

plication rule betweenmatrix representations (that doesn’t approximate the product), which for

184

now we denote by ◦G2g−Gg3, that has the following three properties:

Property 1 (Linearity). First, the product is linear with respect to the samplers by which it is

parameterized, namely,

⟨A ◦G2g−Gg B⟩ = ⟨A ◦G2g B⟩ − ⟨A ◦Gg B⟩.

That is, the matrix that is realized by the new product gives the desired difference.

Property 2 (Smallness is stored). The resulted object, A ◦G2g−Gg B, has “smallness” g and,more

generally, for integers D > d, A ◦GD−Gd B has smallness d such that: if one considers the

product

(A ◦GD−Gd B) ◦GD′−Gd′C

for some matrix representation C, the smallness of the product is d + d′. That is, small-

ness is being stored in the matrix representation and then added back when taking future

products. In fact, the product will also inherit the smallness of the right operand. That is,

(A ◦GD−Gd B) ◦GD′−Gd′

(C ◦GD′′−Gd′′

D)

has smallness d+ d′ + d′′ 4, and so forth.

Property 3 (Smallness implies small norm). If A has smallness s then ∥A∥∞ ≤ 2−Ω(s).

Using the new product rule, an instructive way of thinking of Equation (5.3) is by rewriting it as

⟨A ◦Gk B⟩ = ⟨A ◦Gg B⟩+ ⟨A ◦G2g−Gg B⟩+ ⟨A ◦G4g−G2g B⟩+ . . .+ ⟨A ◦Gk−Gk/2 B⟩, (5.4)

and thinking of A ◦Gg B as a “rough approximation” of the product we care about (rough since

g ≪ k), which have 0 smallness. The object A ◦G2g−Gg B is the first “correction term” having3Use of delta of samplers in the approximation is the reasonwe get a pseudorandompseudo-distribution instead

of a PRG.4We don’t achieve exactly the sum but with the right parameters, the smallness is effectively this.

185

smallness g, A ◦G4g−G2g B the second correction term having smallness 2g, and so forth.

In general, for D > d, the representation A ◦GD−Gd B is going to “cost” D addition to the

number of terms in the “expectation” and have smallness d. SettingD = 2d, as we did in Equa-

tion (5.3), guarantees that in some intuitive sense, up to a constant factor, what is being paid

for is invested. Thus, as you increase the number of terms in the sparsification of the product,

the terms become smaller and this investment by using a sampler does not go to waste as it is

somehow stored as smallness in the object. And, matrices with large smallness can be discarded

without much affect on the total error. Thus, intuitively, we get a O(log(1/ε)) dependence as

any more investment makes the terms small enough to be discarded in the process of recursive

sparsification.

5.1.4 How to “store” smallness

In the construction presented in Section 5.1.2, a matrix was represented by a “one dimensional”

sequence A = (A1, . . . ,A2s) of w × w stochastic matrices, and the matrix that was represented,

or realized, by this representation was defined by ⟨A⟩ = Ei[Ai]. In order to “store” smallness, we

first need to devise a more subtle representation of matrices. This will require a fair amount of

preparation, and such representation is given in Section 5.1.7. To begin, in this section we define

the notions of matrix bundles, matrix bundle sequences, and smallness, somewhat informally.

Definition 5.1.2 (Matrix bundles: Restatement of Definition 5.4.1). For an integer ℓ ≥ 0, an

ℓ-matrix bundle A is a sequence

A = ((α1,A1), . . . , (α2ℓ ,A2ℓ)),

where the αi’s are real numbers (that are not necessarily bounded, and can take both positive and

negative values) and the Ai’s are w × w stochastic matrices5. The matrix that is realized by A is

defined by ⟨A⟩ =∑2ℓ

i=1 αiAi. We extend any matrix norm ∥ · ∥ to matrix bundles by letting5For the purpose of derandomizing ROBPs or randomized log-space, all Ai’s are 0-1 matrices.

186

∥A∥ = ∥⟨A⟩∥. We refer to the numbers α1, . . . , α2ℓ as the coefficients ofA.

Definition 5.1.3 (Matrix bundle sequences (MBSs): Restatement of Definition 5.4.3). Let

dout, din ≥ 0 be integers. A (dout, din)-matrix bundle sequence (MBS) A is a sequence of 2dout

number of din-matrix bundlesA = (A1, . . . ,A2dout ). The matrix that is realized byA is defined

by ⟨A⟩ = Ei∼[2dout ] ⟨Ai⟩.We extend any matrix norm ∥ · ∥ to MBSs by letting ∥A∥ = ∥⟨A⟩∥.

We refer to the union of the coefficients ofA1, . . . ,A2dout as the coefficients ofA.

Amatrix bundle sequence is not going tobe thefinal representationof amatrix in the construc-

tion but rather it will be used to represent a “piece” of the matrix with some smallness, alluded

to in the above discussion as a correction term or the first rough approximation term. Before

presenting the final representation, we need to understand MBSs better. We start by giving the

formal definition of “smallness”, which we already informally discussed above. In the following

section, we define multiplication rules for MBSs and show their interplay with smallness.

Definition 5.1.4 (Smallness: Restatement of Definition 5.4.5). LetA = (A1, . . . ,A2dout ) be a

(dout, din)-MBS. The smallness ofA, denoted by σ(A), is defined by

σ(A) = − log2 Ei∼[2dout ]

∥Ai∥2∞.

It is straightforward to show that if σ(A) ≤ s then ∥A∥∞ ≤ 2s/2 (see Claim 5.4.1). Thus, if

anMBS has a sufficiently large smallness, it can be discarded with low cost in error.

5.1.5 Multiplication rules forMBSs

In Section 5.1.2, we defined the multiplication rule ◦G between “one-dimensional” sequence

of matrices. We now turn to define a multiplication rule for MBSs. In fact, we are going to

introduce two types of multiplication rules which we refer to as outer-multiplication and inner-

multiplication (for the actual construction, we need to consider four new multiplication rules

as we need to worry about the order in which we multiply matrices. However, in this informal

proof overview, we allow ourselves to be somewhat informal regarding this point). We define

187

these multiplication rules to ensure that the smallness is stored while keeping the number of the

matrices in the representation in check. The outer-multiplication is an extension of the mul-

tiplication rule used in Section 5.1.2 whereas the inner-multiplication is carefully engineered to

workwith smallness. In the next section, we describe how themultiplication rule is definedwhen

parameterized by a delta of samplers.

For the description of both multiplication rules, let A be a (dout(A), din(A))-MBS and B a

(dout(B), din(B))-MBS. Let G = ([2dout(A)], [2dout(B)],E) be a left-regular bipartite graph with

left-degree 2d. Note thatGmay be unbalanced. Indeed, the flexibility of working with samplers

for which dout(A) ≫ dout(B) is pivotal for the new construction.

The outer-multiplication denoted by A◦G B, is the (dout(A) + d, din(A) + din(B))-

MBS C = (Ci,j)i∈[2dout(A)],j∈[2d] that is defined as follows. For every i ∈ [2dout(A)] and j ∈ [2d],

Ci,j is the (din(A) + din(B))-matrix bundle that is obtained by taking all products of matrices,

and of the respective coefficients, from the matrix bundles Ai and BΓ(i,j) (the formal definition is

given in Definition 5.5.1). Note that for every i, j,

⟨Ci,j⟩ = ⟨Ai⟩⟨BΓ(i,j)⟩.

The inner-multiplication denoted byA•G B is a (dout(A), din(A) + din(B) + d)-MBS

C = (Ci)i∈[2dout(A)], where Ci is the (din(A) + din(B) + d)-matrix bundle that is obtained by

taking the product of all matrices in thematrix bundle Ai with all thematrices in all of thematrix

bundles in {Bj | j ∈ Γ(i)}, where the respective coefficients are multiplied accordingly and then

divided by 2d to yield

⟨Ci⟩ = ⟨Ai⟩ Ej∼Γ(i)

⟨Bj⟩.

The formal definition is given in Definition 5.5.3. Note that when applying the outer-

multiplication, we pay the degree of the sampler in dout, whereas the inner-multiplication in-

creases din by the degree. The fact that we need to normalize by 2−d is one reason we need the

188

flexibility of maintaining arbitrary coefficients in the definition of matrix bundles.

By adapting the proof of Lemma 5.1.1, we can prove that both the inner and outer multipli-

cation rules, when parameterized by a good sampler, approximate the product.

Lemma 5.1.2. [Idealized] Let 0 < ε, δ < 1. LetA and B be MBSs as defined above. If G is an

(ε, δ)-sampler then

∥⟨A◦G B⟩ − ⟨A⟩⟨B⟩∥∞ ≤ w2(ε+ δ),

∥⟨A•G B⟩ − ⟨A⟩⟨B⟩∥∞ ≤ w2(ε+ δ).

For a formal statement and its proof see Lemma 5.5.3. The key property of the multiplication

rule •G is that it preserves the smallness of both MBSs it operates on, when parameterized with

a good enough sampler G. The following lemma is an idealized version of an assertion we can

actually make (see Lemma 5.5.4 for the formal statement).

Lemma 5.1.3. [Idealized]LetA be a (dout(A), din(A))-MBS and letB be a (dout(B), din(B))-

MBS. Let 0 < ε, δ < 1. Let G = ([2dout(A)], [2dout(B)],E) be an (ε, δ)-sampler with

ε ≤ 2−σ(B),

δ ≤ 2−σ(A)−σ(B). (5.5)

Then, σ(A•G B) ≥ σ(A) + σ(B).

We prove a weaker variant of Lemma 5.1.3 in Section 5.1.5. Before, in Section 5.1.5, we give

some remarks on the asymmetry between the roles that ε and δ play in the lemma, and discuss

unbalanced samplers.

Unbalanced samplers and the asymmetry between ε and δ

Lemma 5.1.3 states that the smallnesses ofA,B are completely preserved, or “stored” inA•G B,

as long as the sampler G has good enough parameters. Note the asymmetry between ε and δ.

189

Indeed, while δ is required to be taken exponentially small in the sum σ(A) + σ(B), ε only needs

to be exponentially small ›in σ(B). Thismay allow for a significant saving in cases where σ(A) ≫

σ(B). However, the sampler used above has degree poly(1/ε, 1/δ) and thus cannot exploit this

saving. In fact, if one considers only balanced samplers, namely, samplers G = (L,R,E) with

|L| = |R|, then a polynomial dependence of the degree on 1/ε and 1/δ is necessary. We are

therefore led to consider unbalanced samplers.

As it turns out, unbalanced samplers are equivalent to seeded extractors [Zuc97], and the state

of the art construction of unbalanced samplers is obtained by seeded extractors. In particular,

for all integers ℓ, r and 0 < δ < 1 such that ℓ ≥ r/δ2 there exists an explicit (ε, δ)-sampler G =

([ℓ], [r],E) with degree poly(1/ε, log(1/δ)) (see Theorem 5.2.2). That is, if the ratio between

the sides of the sampler is large enough, the degree of the sampler has an exponentially better

dependence on δ thanwhat can be obtained by using balanced samplers. Thus, roughly speaking,

by working with unbalanced samplers, Lemma 5.1.3 tells us that we gain the sum of smallnesses

σ(A) + σ(B) by paying roughlymin(σ(A), σ(B)) in the degree.

Proof of a weaker version of Lemma 5.1.3

Next, we give a proof for a weaker version of Lemma 5.1.3. We give the proof for a relaxed setting

inwhich thematrix bundlesAi,Bj that composeA,B are of bounded norm, in particular ∥Ai∥∞

and ∥Bj∥∞ are all bounded by 1. Moreover, we will not prove a bound as strong as stated above

for the smallness of σ(A•G B). Instead, we prove that σ(A•G B) ≥ σ(A) + σ(B)− 2. In fact,

even in the formal proof we cannot give a bound of σ(A)+ σ(B) though it will be crucial to give

a bound of the form σ(A) + σ(B)− τ for some suitable slowly growing function τ = o(1).

Proof of Lemma 5.1.3. Write C = A•G B = (Ci)2dout(A)

i=1 . For i ∈ [2dout(A)], define

ε(i) = Ej∼Γ(i)

∥Bj∥2∞ − 2−σ(B).

As G is an (ε, δ)-sampler, and since we assume that for all j ∈ [2dout(B)], ∥Bj∥∞ ≤ 1, there exists

190

a set S ⊆ [2dout(A)] of size |S| ≥ (1− δ)2dout(A) such that for every i ∈ S, |ε(i)| ≤ ε.Recall that

for every i ∈ [2dout(A)],


⟨Bj⟩.

By Jensen’s inequality and since ∥ · ∥∞ is sub-multiplicative (and sub-additive),

2−σ(C) = Ei∥Ci∥2∞

= Ei∥⟨Ai⟩ E

j∼Γ(i)⟨Bj⟩∥2∞

≤ Ei

[∥Ai∥2∞ E

j∼Γ(i)∥Bj∥2∞

].

Thus,

2−σ(C) ≤ Ei

[∥Ai∥2∞(2−σ(B) + ε(i))

]= 2−σ(A)−σ(B) + E

i

[∥Ai∥2∞ε(i)

]. (5.6)

As we assume ∥Ai∥2∞ ≤ 1 and since |ε(i)| ≤ 1 for all i ∈ [2dout(A)],

Ei

[∥Ai∥2∞ε(i)

]≤ E

i

[∥Ai∥2∞ε(i)

∣∣ i ∈ S]+ Pr[i ∈ S]

≤ ε · Ei

[∥Ai∥2∞

∣∣ i ∈ S]+ δ.

Since we might as well assume δ ≤ 1/2, we have that Pr[i ∈ S] ≥ 1− δ ≥ 1/2, and so

Ei

[∥Ai∥2∞

∣∣ i ∈ S]≤ Ei [∥Ai∥2∞]

Pr[i ∈ S]≤ 2−σ(A)+1.

Hence, Ei [∥Ai∥2∞ε(i)] ≤ 2ε · 2−σ(A) + δ. Plugging this to Equation (5.6), we get

2−σ(C) ≤ 2−σ(A)−σ(B) + 2ε · 2−σ(A) + δ.

191

Substituting for ε, δ, we conclude that σ(C) ≥ σ(A) + σ(B)− 2, as desired.

5.1.6 Multiplication parameterized by a delta of samplers

Now that MBSs and the two multiplication rules are in place, we are ready to define a multi-

plication rule that is parameterized by a delta of samplers. Assume, as in the previous section,

that A is a (dout(A), din(A))-MBS and B is a (dout(B), din(B))-MBS. Let D > d be inte-

gers. Let GD = ([2dout(A)], [2dout(B)],ED) be a left-regular bipartite graph with left-degree 2D

andGd = ([2dout(A)], [2dout(B)],Ed) a left-regular bipartite graph with left-degree 2d.

Write A•GD B = C+ = (C+i )i∈[2dout(A)] and A•Gd B = C− = (C−i )i∈[2dout(A)]. We define

A•GD−Gd B to be the sequence (Ci)i∈[2dout(A)] whereCi is the concatenation of thematrix bundle

C+i with−C−i , where by the leadingminus sign, wemean that one negates all coefficients in C−i .

The formal definition is given in Definition 5.5.5. It is easy to see that

⟨A •GD−Gd B⟩ = ⟨A •GD B⟩ − ⟨A•Gd B⟩,

a property that we refer to as the linearity of • . Further, note that 2din(C) = 2din(A)+din(B)(2D +

2d). Thus, asD ≥ d, we havedin(C) ≤ din(A)+din(B)+D+1. We remark that the relaxation of

using negative numbers in the definition of pseudo-distributions is required so as to allow taking

delta of samplers.

The smallness ofA•GD−Gd B is analyzed in the following lemma, which, again, is an idealized

version of an assertion we can actually make (see Lemma 5.5.6).

Lemma 5.1.4. [Idealized] LetA,B beMBSs as above. Let G1 = ([2dout(A)], [2dout(B)],E1) be an

(ε1, δ1)-sampler and G2 = ([2dout(A)], [2dout(B)],E2) an (ε2, δ2)-sampler. Assume that 0 < ε1 ≤

ε2 < 1 and 0 < δ1 ≤ δ2 < 1. Then,

σ(A•G1−G2 B) ≥ min (log(1/ε2) + σ(A), log(1/δ2)) .

Lemma 5.1.4 states that the smallness of the product grows with the parameters of the weaker

192

(ε2, δ2)-sampler. As in Lemma 5.1.3, the parameter ε2, which is exponentially more “expensive”

than δ2 in terms of degree (at least for unbalanced samplers) is being added to σ(A) and so can be

set much larger than δ2. Unlike Lemma 5.1.3, σ(A•G1−G2 B) can grow beyond σ(A) + σ(B) if

one takes a pair of good enough samplers. That is, the smallness of the product is not bounded

by the sum of smallnesses of the operands.

5.1.7 Matrix representations

We are finally ready to give a high level description of how a matrix is being represented by the

new construction and how to multiply two such matrix representations.

Definition 5.1.5 ( Restatement of Definition 5.6.1). Let 1 ≤ k be an integer. A k-matrix repre-

sentation is a sequenceA = (A0, . . . ,Ak) whereAi is anMBS with σ(Ai) ≥ i. The matrix that

is realized byA is defined by ⟨A⟩ =∑k

i=0 ⟨Ai⟩.

Informally, one should think of 2−k as the desired error guarantee. We think ofA0 as a rough

approximation of the matrix of interest A. Let 1 ≤ g ≪ k be an integer such that that the

approximation is 2−g rather than the desired 2−k, that is, ∥⟨A0⟩ − A∥∞ ≤ 2−g. The remaining

MBSs are the finer and finer correction terms. Adding them improves the approximation up to

the point that ∥⟨A⟩ − A∥∞ ≤ 2−k. For the formal construction, we will need to weight the

differentMBSs and these weights, which we ignore in this high-level description, is why we allow

the ρi’s in a pseudo-distribution to be unbounded (see Section 5.6).

We would like to define a multiplication rule between matrix representations that approxi-

mates the respective matrices. Assume that A = (A0, . . . ,Ak) and B = (B0, . . . ,Bk) are two

matrix representations. We are going to define a multiplication rule · for matrix representations

such that the matrix that is realized by the productA ·B is an 2−Ω(k)-approximation for ⟨A⟩⟨B⟩.

To describe the product, we start by writing

⟨A⟩⟨B⟩ =

(k∑

i=0

⟨Ai⟩

) k∑j=0

⟨Bj⟩

=k∑

i,j=0

⟨Ai⟩⟨Bj⟩.

193

Consider an expensive sampler, say an (ε, δ)-sampler Gk with ε = δ = 2−k. By Idealized

Lemma 5.1.3, for every i, j, we can 2−Ω(k)-approximate ⟨Ai⟩⟨Bj⟩ by ⟨Ai •Gk Bj⟩. Doing so, and

adding the errors from all O(k2) pairs (i, j), we get a total error of 2−Ω(k)k2 = 2−Ω(k). How-

ever, we do not want to pay for an expensive sampler in “one shot”. Instead, for every pair of

i, j ∈ {0, 1, . . . , k}, consider the sequence of MBSs

Ai •Gd Bj, Ai •G2d−Gd Bj, Ai •G4d−G2d Bj, . . . , Ai •Gk−Gk/2 Bj, (5.7)

where the choice of d, and whether to use a balanced or an unbalanced sampler depends on i, j

andwill be discussed later. Moreover, for some pairs i, j, we will use the outer-multiplication rule

in some of theMBSs in the list. By Idealized Lemma 5.1.3, σ(Ai •Gd Bj) ≥ i+ jwhen d is taken

sufficiently large. Further, by IdealizedLemma5.1.4, σ(Ai •G2d−Gd Bj) ≥ i+j+d (the parameters

are chosen such that the smallness is effectively this) and generally σ(Ai •G2r+1d−G2rd Bj) ≥ i+ j+

2rd. That is, eachMBS in the list has a certain smallness we know how to bound from below.

Consider the collection of allMBSs obtained by considering theMBSs in Equation (5.7) for all

i, j ∈ {0, . . . , k}. We denote this set of MBSs by F(A,B) (see Definition 5.7.1). To obtain the

matrix representationC = A·B = (C0, . . . , Ck), we collectMBSs fromF(A,B)with a common

smallness s (or, more precisely, MBSs for which the best lower bound we have on their smallness

is s) and “glue” them to form the MBS Cs. We discard MBSs that have smallness larger than k.

We glue MBSs by concatenating the two sequence of matrix bundles and factor the coefficients

accordingly to yield anMBS with a slightly larger dout (see Section 5.4.3).

5.1.8 Leveled matrix representations and setting of parameters

In this section, we give further information regarding the multiplication rule between matrix

representation discussed in the previous section. In particular, we left out details about how to

set d as a function of i, j, and whether the multiplication is parameterized by a balanced or an

unbalanced sampler. The way we set things is as follows. Let A = (A0,Ag, . . . ,Ak) be a k-

194

matrix representation, where k would be chosen later on. We maintain the invariant that there

are no MBSs but for A0 with smallness less than g in A. g << k would be chosen later too.

We partition the latter sequence to levels. The first MBS,A0 is in level 0. MBSs with smallness

[g, 2g) are in level 1; MBSs with smallness [2g, 4g) are in level 2, and so forth. In fact, we are also

required to maintain the invariant that all smallnesses are multiples of g. We do the same for a

second k-matrix representationB = (B0,Bg, . . . ,Bk). For a formal treatment, see the definition

of leveled matrix representation given in Section 5.6.

Consider now any i, j > 0. IfAi,Bj belong to the same level (implying that i/2 ≤ j ≤ 2i)

we use the inner-multiplication rule to multiplyAi,Bj using balanced samplers. If i, j belong to

different levels, we use unbalanced samplers instead. In all such cases we are going to set d =

O(min(i, j)). Handling i = 0 or j = 0 is done similarly, using unbalanced samplers, but using

the outer-multiplication rule for the first MBS in Equation (5.7). In such cases, d is set toO(g).

Every stochastic matrix in the construction corresponds to a path in the pseudo-distribution.

As every MBS Ai in A consists of 2din(Ai)+dout(Ai) such matrices, the total number of paths is∑ki=0 2

din(Ai)+dout(Ai). As din, dout are increasing functions of i, the seed length is dominated by

din(Ak) + dout(Ak). We turn to analyze each of dout, din.

Analyzing dout. the unbalanced samplers are all set with δ = 2−Ω(k) and so we are required

to maintain the invariant that the dout of MBSs increases in “jumps” of Ω(k) across levels. As

the number of levels is logarithmic in k, this requires dout to be as large as k log k for MBSs with

smallness k. The fact that we set d = g when using the outer-multiplication rule with i = 0

or j = 0 causes dout to further increase by g in every recursive level. As described before, we

multiply nmatrices recursively, for example, we multiply the first n/2 matrices and the last n/2

matrices separately and then multiply the outputs to get the product of nmatrices. As we have

log n recursive levels, the bound that we get on the maximum dout isO(k log k+ g log n).

Analyzing din. Using the interleaved use of balanced and unbalanced samplers, we are able to

maintain the invariantdin(Ai) = O(i log i) throughout the recursion, independently of the level

195

of the recursion. In particular, din of all MBSs is bounded by O(k log k) and is thus dominated

by dout. To give some idea of why such bound is obtained, note that for every i, j, the first MBS

in Equation (5.7) has smallness i + j and for that we pay min(i, j) in din. For the remaining

MBSs, payingmin(i, j) indin credits onewith a proportional smallness. Solving for the respective

recursive relation gives the stated bound.

Setting k, g. So far, while we paid for choosing a large value of g in dout, the role of g in the

analysis was not explained. Without getting into the technical details, the finer-grained error

analysis that we conduct, guarantees that at recursive-level t, the total error is bounded above by

ε(t) = w · (k/g)kt/g · 2−k,

and so we set g ≈ log n · log k to yield ε(log n) = w · 2−Ω(k) and then k = Ω(log(w/ε))

to guarantee total error ε. For simplicity, set w = n. In such case, k = O(log(n/ε)) and g =

O(log n · log log(n/ε)). Plugging this to the bound on dout, we get seed length of O(k log k +

g log n) = O(log2 n+ log(1/ε)). To obtain the main result, which note is slightly stronger, we

make a more careful setting of parameters.

5.2 Preliminaries

For ease of readability, we avoid the use of floor and ceiling. This does not affect the stated results.

For an integer n ≥ 1 we use Un to denote the uniform distribution over n-bit strings. Let b be a

boolean expression. We define the indicator 1b to be 1 if b holds and 0 otherwise. For an integer

n ≥ 1 we let [n] = {1, 2, . . . , n}. Let A ⊆ B be finite sets. We denote by μB(A) the density of

A within B, namely, μB(A) = |A|/|B|. Typically, B will be clear from context, in which case we

write μ(A).

Let G = (L,R,E) be a bipartite graph. We say G is left-regular if all nodes in L have the

same degree. IfG is left-regular with left-degree d and edges labeled by {1, . . . , d}, we define the

neighborhood function ΓG : L × [d] → R to be such that the i’th neighbor of node v ∈ L is

196

given by ΓG(v, i). We denote the set of neighbors of v by ΓG(v). If G is clear from context we

sometimes omit it from the subscript and simply write Γ(v, i) and Γ(v) for ΓG(v, i) and ΓG(v),

respectively.

5.2.1 Read-once branching programs, hitting sets, and pseudorandom distri-

butions

In this section we recall basic definitions related to read-once branching programs fromChapter

1 (Section 1.2). Definition 5.2.1 below is slightly different from the informal definition that was

used in the start of this chapter, though the two definitions can be easily shown to be equivalent.

Definition 5.2.1. Let n,w ≥ 1 be integers. An (n,w)-read-once branching program (ROBP for

short) P is a directed graph on the vertex set V = {s} ∪⋃n

i=1 Pi, where the Pi’s are disjoint sets of

size w each. We refer to Pi as layer i of the program P. From every node but for those that belong to

Pn there are two outgoing edges, labeled by 0 and 1. The pair of edges from s ends in P1 and for every

1 ≤ i < n and v ∈ Pi, the pair of edges going out of v end in nodes that belong to Pi+1. There are

no edges leaving Pn. The node s is called the start node of the program P.

Given a string p ∈ {0, 1}ℓ, with ℓ ≤ n, we denote by P(p) the node that is reached by traversing

theROBPP according to the path p starting at the start node. The set of all (w, n)-ROBPs is denoted

byPw,n.

Given any distribution D on {0, 1}n, we use P(D) to denote the distribution over v ∈ Pn

when the input to P is distributed according toD.

Definition 5.2.2 (Hitting sets). A set {p1, . . . , p2s} ⊆ {0, 1}n is an (n,w, ε)-hitting set if for

every P ∈ Pw,n and node v ∈ Pn for which Pr[P(Un) = v] ≥ ε, there exists j ∈ [2s] such that

P(pj) = v.

It is sometimes convenient to address the function that generates the hitting set.

Definition 5.2.3 (Hitting set generators). A function HSG : {0, 1}s → {0, 1}n is an (n,w, ε)-

hitting set generator (HSG for short) if the image ofHSG is an (n,w, ε)-hitting set. We refer to the

197

input ofHSG as the seed. Note that 2s is an upper bound on the size of the hitting set.

Definition 5.2.4 (Pseudorandom distributions). A distribution D over n-bit string is an

(n,w, ε)-pseudorandom distribution if for every P ∈ Pw,n and v ∈ Pn,

∣∣Pr[P(Un) = v]− Pr[P(D) = v]∣∣ ≤ ε.

Clearly, the support of every (n,w, ε)-pseudorandom distribution is an (n,w, ε′)-hitting set

for any ε′ > ε. As with hitting sets, it is sometimes convenient to address the function that

generates the pseudorandom distribution.

Definition 5.2.5 (Pseudorandom generators). A function PRG : {0, 1}s → {0, 1}n is an

(n,w, ε)-pseudorandom generator (PRG for short) if the distribution PRG(Us) is (n,w, ε)-

pseudorandom. We refer to the input of PRG as the seed.

5.2.2 Matrix norms

Throughout the chapter, wemake use of twomatrix norms. LetA be aw×w real matrix. Recall

that the infinity norm of A is defined by ∥A∥∞ = maxi∈[w]∑w

j=1 |Ai,j|. Themax norm of A is

given by ∥A∥max = maxi,j∈[w] |Ai,j|. We denote the set of w × w stochastic matrices by Sw. We

make use of the following well-known, easy to verify, facts:

Claim 5.2.1. Let A,B be w× w real matrices. Then,

• The norm ∥ · ∥∞ is sub multiplicative, namely, ∥AB∥∞ ≤ ∥A∥∞∥B∥∞.

• Both norms (by definition) are sub-additive, that is, ∥A + B∥∞ ≤ ∥A∥∞ + ∥B∥∞ and

∥A+ B∥max ≤ ∥A∥max + ∥B∥max.

• ∥A∥max ≤ ∥A∥∞ ≤ w∥A∥max.

• If A ∈ Sw then ∥A∥∞ = 1.

198

5.2.3 Samplers

Definition 5.2.6 ([BR94]). Let 0 < ε, δ < 1. A left-regular bipartite graph G = (L,R,E) is an

(ε, δ)-sampler if for every function f : R → [0, 1], for all but δ-fraction of vertices v ∈ L it holds

that ∣∣∣∣ Ei∼Γ(v)

[f(i)]− Ei∼R

[f(i)]∣∣∣∣ ≤ ε.

The left-degree of G is called the degree of the sampler.

In many cases, the range of the function f, whose expectation we want to approximate, is not

bounded to [0, 1]. We thus use the following easy claim.

Claim 5.2.2. Let m1,m2 ≥ 0 be real numbers. Let G = (L,R,E) be an (ε, δ)-sampler and

f : R → [−m1,m2]. Then, for all but δ-fraction of vertices v ∈ L,

∣∣∣∣ Ei∼Γ(v)

[f(i)]− Ei∼R

[f(i)]∣∣∣∣ ≤ ε(m1 +m2).

For the construction of pseudorandom pseudo-distributions, we make use of two construc-

tions of samplers. The first has equal sides, namely, |L| = |R| whereas the second sampler has

better parameters, albeit, it requires |L| ≫ |R|. We refer to the first one, informally, as a balanced

sampler and to the second one as an unbalanced sampler. The constructions of these samplers

rely on expander graphs and seeded extractors, respectively, andwe refer the reader to the excellent

survey by Goldreich [Gol11] for more information.

Theorem 5.2.1 ([GW97]). For every integer n and all ε, δ > 0, there exists an (ε, δ)-sampler

BSamp(n, ε, δ) = (L,R,E), with |L| = |R| = n, having degree d = O(δ−1ε−2).

Theorem 5.2.2 ([RVW01], Corollary 7.3 6). There exists a universal constant c ≥ 1 such that the

following holds. For all ε, δ > 0 such that log(1/δ) > log(1/ε)clog∗(1/δ) and for all integers ℓ, r6We note that there are several versions of the cited paper. The conference and journal versions do not contain

the results we need, and so we cite the version posted on ECCC.

199

such that ℓ ≥ r/δ2 there exists an (ε, δ)-sampler UBSamp(ℓ, r, ε, δ) = ([ℓ], [r],E) with degree

d = ((1/ε) log(1/δ))c .

It can be shown that both samplers are log-space computable, namely, given i ∈ L and j ∈ [d],

the j’th neighbor of vertex i can be computed inO(log |L|) space (and in time poly log |L|). This

assertion is well-known for the sampler that is given by Theorem 5.2.1, whose construction is

based on expander graphs, as was used in [INW94]. The assertionwith respect to Theorem 5.2.2

is only implicit in the literature. The assertion can be shown to hold because the samplers are

obtained by composing expander graphs, hash functions and k-wise independent distributions

in simple ways (simple to compute, not to analyze).

Working with the parameters of the sampler given by Theorem 5.2.2 is cumbersome. Thus,

for the sake of readability, we make use of the following sampler which has parameters that are

easier to work with. We stress that this sampler in not space-efficient. It is easy to verify that the

result holds as is when using the space-efficient sampler that is given by Theorem 5.2.2. Indeed,

the seed length of the new construction only deteriorates by a factor of 2O(log∗(nw/ε)) which is

then hidden under the O-notation. Further, the space complexity is linear in the seed length.

Theorem 5.2.3 ([Zuc07]). There exists a universal constant csamp ≥ 1 such that the following

holds. For all integers ℓ, r and all ε, δ > 0 for which ℓ ≥ r/δ2, there exists an (ε, δ)-sampler

UBSamp(ℓ, r, ε, δ) = ([ℓ], [r],E) with degree d = ((1/ε) · log(1/δ))csamp .

Fromhere on, we suppress the size of the samplersn, ℓ, r and simplywriteBSamp(ε, δ) for the

sampler that is given by Theorem 5.2.1 andUBSamp(ε, δ) for the sampler fromTheorem 5.2.3.

5.3 Pseudorandom Pseudo-Distributions andMain Result

In this section we introduce the notion of a pseudorandom pseudo-distribution.

Definition 5.3.1 (Pseudorandom pseudo-distributions). Let ρ1, . . . , ρ2s ∈ R and p1, . . . , p2s ∈

{0, 1}n. The sequence D = ((ρ1, p1), . . . , (ρ2s , p2s)) is an (n,w, ε)-pseudorandom pseudo-

200

distribution if for every P ∈ Pw,n and v ∈ Pn,

∣∣∣Pr[P(Un) = v]−2s∑i=1

ρi1P(pi)=v

∣∣∣ ≤ ε.

For a real number b ≥ 0, we say that D is b-bounded if |ρi| ≤ b for all i ∈ [2s].

We refer to s as the seed length of the pseudorandom pseudo-distribution.

We observe that pseudo-distributions readily yield hitting sets.

Claim 5.3.1. Let ((ρ1, p1), . . . , (ρ2s , p2s)) be an (n,w, ε)-pseudorandom pseudo-distribution.

Then, for every ε′ > ε, p1, . . . , p2s is an (n,w, ε′)-hitting set.

Proof. Let ε′ > ε be a real number. Let P ∈ Pw,n and consider v ∈ Pn for which Pr[P(Un) =

v] ≥ ε′. We have that

2s∑i=1

ρi1P(pi)=v ≥ Pr[P(Un) = v]− ε ≥ ε′ − ε > 0

which readily implies the existence of g ∈ [2s] such that P(pg) = v.

We are now ready to give a formal statement of the main result.

Theorem 5.3.1 (Main result). For every integers n,w ≥ 1 and 0 < ε < 1/n, there exists an

(n,w, ε)-pseudorandom pseudo-distribution D with seed length

d = O(log(n) log(nw) + log(1/ε)).

Furthermore, D is poly(w/ε)-bounded, and can be computed in space O(d).

Here, by computability in space-s, we mean that the pseudorandom pseudo-distribution D is

generated by a pseudorandom pseudo-generator (PRPG) which is computable in space-s, that is

given the seed j ∈ [2d] and the index i ∈ [n], the real number ρj and ith bit of the path pj can be

computed inO(s) space.

201

Remarkregardingexplicitness. Note that in the proof ofTheorem5.3.1, weuse the un-

balanced sampler that is given byTheorem5.2.3, whose parameters are easy toworkwith, though

its space-complexity is high. By plugging-in, instead, the space-efficient sampler that is given by

Theorem 5.2.2, one can easily show that the seed length and space complexity are as stated. In-

deed, the seed length of the new construction only deteriorates by a factor of 2O(log∗(nw/ε)) when

using the space-efficient sampler fromTheorem5.2.2. This small loss is anyhowhiddenunder the

O-notation. We choose to omit the cumbersome details as this complicates the already involved

proof.

5.4 Matrix Bundle Sequences

In this section, we formally define the notion of amatrix bundle sequence (MBS for short). In-

formally speaking, anMBS is a “piece” of amatrix that we are interested in. To represent amatrix

wemake use of severalMBSs. AnMBS has a property we call smallness that, informally, captures

how small the piece is. This is somewhat analogous to the digits of a number when represented

in a decimal expansion, where the location of the digit are analogous to its smallness. We start by

definingmatrix bundles.

5.4.1 Matrix bundles

Definition 5.4.1. Let ℓ ≥ 0, w ≥ 1 be integers. An (ℓ,w)-matrix bundle A is an element of

(R × Sw)2ℓ . Namely, A = ((α1,A1), . . . , (α2ℓ ,A2ℓ)), where the αi’s are real numbers (that are

unbounded and can be negative) and the Ai’s are w × w stochastic matrices 7. The matrix that is

realized by A is defined by ⟨A⟩ =∑2ℓ

i=1 αiAi.We extend anymatrix norm ∥ · ∥ to matrix bundles

by letting ∥A∥ = ∥⟨A⟩∥. We refer to the numbers α1, . . . , α2ℓ as the coefficients ofA.

Next, we define the product of a scalar by a matrix bundle.7For the purpose of derandomizing ROBPs, think of eachAi as a matrix corresponding to a single path (of some

length≤ n) and is thus, a 0-1 matrix.

202

Definition 5.4.2. For a real number β and an (ℓ,w)-matrix bundle A =

((α1,A1), . . . , (α2ℓ ,A2ℓ)), we define β · A to be the (ℓ,w)-matrix bundle

((βα1,A1), . . . , (βα2ℓ ,A2ℓ)).We sometimes write βA instead of β · A. Note that ⟨βA⟩ = β⟨A⟩.

5.4.2 Matrix bundle sequences

Definition 5.4.3. Let dout, din ≥ 0 and w ≥ 1 be integers. A (dout, din,w)-matrix bundle se-

quence (MBS)A is a sequence of 2dout number of (din,w)-matrix bundlesA = (A1, . . . ,A2dout ).

The matrix that is realized byA is defined by ⟨A⟩ = Ei∼[2dout ] ⟨Ai⟩.We extend any matrix norm

∥ · ∥ to MBSs by letting ∥A∥ = ∥⟨A⟩∥. We refer to the union of the coefficients of A1, . . . ,A2dout

as the coefficients ofA.

Definition 5.4.4. AnMBSA is called thin if din(A) = 0 and all coefficients ofA equal 1.

Definition 5.4.5. LetA = (A1, . . . ,A2dout ) be a (dout, din,w)-MBS. The smallness ofA, denoted

by σ(A), is defined by

σ(A) = − log Ei∼[2dout ]

∥Ai∥2∞,

where recall that all logarithms in this chapter are to the base 2. The magnitude ofA, denoted by

μ(A), is defined by

μ(A) = log maxi∈[2dout ]

∥Ai∥2∞.

Claim 5.4.1. LetA = (A1, . . . ,A2dout ) be a (dout, din,w)-MBS. Then, ∥A∥∞ ≤ 2−σ(A)/2.

Proof. By the sub-additivity of ∥ · ∥∞,

∥A∥∞ = ∥⟨A⟩∥∞ = ∥Ei⟨Ai⟩∥∞ ≤ E

i∥Ai∥∞.

By Jensen’s inequality, (Ei∥Ai∥∞

)2≤ E

i∥Ai∥2∞ = 2−σ(A),

and so ∥A∥2∞ ≤ 2−σ(A),which completes the proof.

203

Remarks regarding the monotonicity of din, dout. Let A = (A1, . . . ,A2dout ) be

a (dout, din,w)-MBS. For any d′in ≥ din, one can consider the (dout, d′in,w)-MBS A′ =

(A′1, . . . ,A′2dout ) that is obtained by extending each of the (din,w)-matrix bundlesAi to a (d′in,w)-

matrix bundle A′i by appending 2d′in−din zero coefficients and arbitrary stochastic matrices. Note

that ⟨Ai⟩ = ⟨A′i⟩ and so this operation has no affect on the parameters ofA other than din, and

in particular, σ(A′) = σ(A) and μ(A′) = μ(A). Therefore, using this padding argument, one

can think of every (dout, din,w)-MBS as an (dout, d′in,w)-MBS with the same parameters for any

d′in ≥ din.

Note that the same argument holds even if din is not an integer (this happens when we con-

catenate thematrix bundles of twoMBSswith (din)1 = (din)2, resulting in 2din = 2(din)1+2(din)2 ,

which indeed is not a power of 2). In particular, we implicitly always round din up to an integer

by using this padding argument.

Similarly, one can considerA to be a (d′out, din,w)-MBS for any d′out ≥ dout. This is because

one can take the MBSA′′ with Ai duplicated 2d′out−dout times to form a sequence of length 2d′out .

Clearly, dout(A′′) = d′out and ⟨A⟩ = ⟨A′′⟩. Note that this transformation has no effect on din,

μ, σ.

Definition 5.4.6. LetA = (A1, . . . ,A2dout ) be a (dout, din,w)-MBS. For a real number α ≥ 0,

define α · A, which we also write as αA, to be the (dout, din,w)-MBS (αA1, . . . , αA2dout ).

Claim 5.4.2. LetA be a (dout, din,w)-MBS and α > 0 a real number. Then,

• ⟨αA⟩ = α⟨A⟩;

• σ(αA) = σ(A) + 2 log(1/α);

• μ(αA) = μ(A)− 2 log(1/α).

Proof. The first item follows as

⟨αA⟩ = Ei⟨αAi⟩ = α E

i⟨Ai⟩ = α⟨A⟩.

204

As for the second item,

2−σ(αA) = Ei∥αAi∥2∞ = α2 E

i∥Ai∥2∞ = α22−σ(A),

and so σ(αA) = σ(A) + 2 log(1/α). As for the magnitude,

2μ(αA) = maxi

∥αAi∥2∞ = α2 maxi

∥Ai∥2∞ = α22μ(A),

and so μ(αA) = μ(A)− 2 log(1/α).

5.4.3 GluingMBSs

For the main construction, we will need to “glue” MBSs, namely, stack the matrix bundles that

compose two or more MBSs to one sequence. In this section, we formally define this operation

and analyze the resulting “glued”MBS. In the following definition, we assume that the twoMBSs

to be glued have the same dout. This is essentially without loss of generality as explained in the

remark in Section 5.4.2.

Definition 5.4.7. Let A = (A1, . . . ,A2dout ), B = (B1, . . . ,B2dout ) be a pair of (dout, din,w)-

MBSs. We define the gluing ofA and B, denoted by glue(A,B) to be the (dout + 1, din,w)-MBS

C = (C1, . . . ,C2dout+1) that is defined by

Ci =

Ai, i ∈ [1, 2dout ];

Bi−2dout , i ∈ [2dout + 1, 2dout+1].

Claim 5.4.3. LetA = (A1, . . . ,A2dout ), B = (B1, . . . ,B2dout ) be a pair of (dout, din,w)-MBSs.

Then,

⟨glue(A,B)⟩ = ⟨A⟩+ ⟨B⟩2

.

205

Moreover,

σ(glue(A,B)) ≥ min(σ(A), σ(B)),

μ(glue(A,B)) ≤ max(μ(A), μ(B)).

Proof. We have that

⟨glue(A,B)⟩ = Ei∼[2dout+1]

⟨Ci⟩

=1

2dout+1

2dout∑i=1

⟨Ai⟩+2dout∑i=1

⟨Bi⟩

=

12

(E

i∼[2dout ]⟨Ai⟩+ E

i∼[2dout ]⟨Bi⟩

)=

⟨A⟩+ ⟨B⟩2

.

As for the smallness of glue(A,B),

2−σ(glue(A,B)) = Ei∼[2dout+1]

∥Ci∥2∞

=1

2dout+1

2dout∑i=1

∥Ai∥2∞ +2dout∑i=1

∥Bi∥2∞

=

12

12dout

2dout∑i=1

∥Ai∥2∞ +1

2dout

2dout∑i=1

∥Bi∥2∞

=

12(2−σ(A) + 2−σ(B)

)≤ max

(2−σ(A), 2−σ(B)

),

which implies that σ(glue(A,B)) ≥ min(σ(A), σ(B)), as claimed. The proof regarding the

magnitude of glue(A,B) is straightforward, and so we omit it.

Generally, we may need to “glue” more than two MBSs. Let A1, . . . ,Ar be r (dout, din,w)-

MBSs. We extend Definition 5.4.7 in the natural way to define the gluing ofA1, . . . ,Ar which

206

we denote by glue (A1, . . . ,Ar) . The following claim can be proved similarly to the way we

proved Claim 5.4.3 and we omit the details.

Claim 5.4.4. Let r ≥ 1 be an integer. Let A1, . . . ,Ar be (dout, din,w)-MBSs. Let B =

glue (A1, . . . ,Ar). Then, ⟨B⟩ = Ei ⟨Ai⟩.Moreover,

σ(B) ≥ mini

σ(Ai),

μ(B) ≤ maxi

μ(Ai),

dout(B) = dout + log r,

din(B) = din.

5.5 Multiplication Rules forMatrix Bundle Sequences

In this section we define several multiplication rules for MBSs and analyze the products.

5.5.1 The multiplication rules→◦ ,←◦ parameterized by a sampler

Definition 5.5.1. LetA = (A1, . . . ,A2dout(A)) be a (dout(A), din(A),w)-MBS, where

Ai = (((αi)1, (Ai)1) , . . . , ((αi)2din(A) , (Ai)2din(A))) .

Let B = (B1, . . . ,B2dout(B)) be a (dout(B), din(B),w)-MBS, where

Bi =(((βi)1, (Bi)1

), . . . ,

((βi)2din(B) , (Bi)2din(B)

)).

Let G = ([2dout(A)], [2dout(B)],E) be a left-regular bipartite graph with left-degree 2d. We define

the (dout(A) + d, din(A) + din(B),w)-MBS C = A →◦G B as follows: C = (Ci,j)i∈[2dout(A)],j∈[2d],

where the (din(A) + din(B),w)-matrix bundleCi,j is defined by

(Ci,j)k,ℓ = ((αi)k(βΓG(i,j))ℓ, (Ai)k(BΓG(i,j))ℓ),

207

with k ∈ [2din(A)], ℓ ∈ [2din(B)].

Note that C is indeed an MBS as the product of the stochastic matrices (Ai)k, (BΓG(i,j))ℓ is

stochastic. Moreover, C has the dimensions that were claimed in the definition, namely, C is a

(dout(A) + d, din(A) + din(B),w)-MBS.

Claim 5.5.1. For every i ∈ [2dout(A)], j ∈ [2d], ⟨Ci,j⟩ = ⟨Ai⟩⟨BΓG(i,j)⟩.

Proof. We have that

⟨Ci,j⟩ =∑k,ℓ

(αi)k(βΓG(i,j))ℓ(Ai)k(BΓG(i,j))ℓ

=∑k

(αi)k(Ai)k∑ℓ

(βΓG(i,j))ℓ(BΓG(i,j))ℓ

= ⟨Ai⟩⟨BΓG(i,j)⟩.

By Claim 5.5.1,

⟨A →◦G B⟩ = Ei,j

[⟨Ai⟩⟨BΓG(i,j)⟩

]= E

i

[⟨Ai⟩ E

j∼ΓG(i)⟨Bj⟩]. (5.8)

In particular, note that ifK is the complete bipartite graph on [2dout(A)]× [2dout(B)] then

⟨A →◦K B⟩ = ⟨A⟩⟨B⟩. (5.9)

Similarly to the definition of→◦ , we define←◦ as follows. Informally, the difference between→◦

and←◦ is that while sparsifying the product ofA andB, whether we useA orB as the left-side of

the bipartite left-regular graph. This choice depends on the parameters of the MBSs.



208


Bi =(((βi)1, (Bi)1

), . . . ,


)).

Let G = ([2dout(A)], [2dout(B)],E) be a bipartite left-regular graph with left-degree 2d. We define

the (dout(A) + d, din(A) + din(B),w)-MBS C = A ←◦G B as follows: C = (Ci,j)i∈[2dout(A)],j∈[2d],

where

(Ci,j)k,ℓ = ((αi)k(βΓG(i,j))ℓ, (BΓG(i,j))ℓ(Ai)k),

with k ∈ [2din(A)], ℓ ∈ [2din(B)].

Similarly to Claim 5.5.1, we have that

Claim 5.5.2. ⟨Ci,j⟩ = ⟨BΓG(i,j)⟩⟨Ai⟩.

Proof. We have that

⟨Ci,j⟩ =∑k,ℓ

(αi)k(βΓG(i,j))ℓ(BΓG(i,j))ℓ(Ai)k

=∑ℓ

(βΓG(i,j))ℓ(BΓG(i,j))ℓ∑k

(αi)k(Ai)k

= ⟨BΓG(i,j)⟩⟨Ai⟩.

By Claim 5.5.2,

⟨A ←◦G B⟩ = Ei,j

[⟨BΓG(i,j)⟩⟨Ai⟩

]= E

i

[(E

j∼ΓG(i)⟨Bj⟩)⟨Ai⟩

]. (5.10)

In particular, ifK is the complete bipartite graph on [2dout(A)]× [2dout(B)] then

⟨A ←◦K B⟩ = ⟨B⟩⟨A⟩. (5.11)

209

The following lemma relates theproperties of theMBSA →◦G B to those ofA,B. Throughout

the chapter, we will only apply the product →◦ with the right operand being a thin MBS, and so

we restrict ourselves to that case.

Lemma 5.5.1. Let A = (A1, . . . ,A2dout(A)) be a (dout(A), din(A),w)-MBS. Let B =

(B1, . . . ,B2dout(B)) be a (dout(B), 0,w)-thin MBS. Let G = ([2dout(A)], [2dout(B)],E) be a left-

regular bipartite graph with left-degree 2d. Then,

σ(A →◦G B) ≥ σ(A),

μ(A →◦G B) ≤ μ(A),

dout(A→◦G B) = dout(A) + d,

din(A→◦G B) = din(A).

Proof. The assertions regarding din, dout follow by the definition of→◦ and by din(B) = 0. Write

C = A →◦G B and let Γ : [2dout(A)] × [2d] → [2dout(B)] be the neighborhood function of G. By

Claim (5.5.1), ⟨Ci,j⟩ = ⟨Ai⟩⟨BΓ(i,j)⟩.As∥·∥∞ is sub-multiplicative and since ⟨BΓ(i,j)⟩ is stochastic

(due to B’s thinness),

∥Ci,j∥∞ = ∥⟨Ai⟩⟨BΓ(i,j)⟩∥∞

≤ ∥Ai∥∞∥BΓ(i,j)∥∞

= ∥Ai∥∞.

This proves that μ(C) ≤ μ(A). As for the smallness,

2−σ(C) = Ei,j∥Ci,j∥2∞ ≤ E

i∥Ai∥2∞ = 2−σ(A).

The proof of Lemma 5.5.1, which considers the product→◦ can be adapted to prove the same

210

result for←◦ . We summarize this in the following lemma.

Lemma 5.5.2. Let A = (A1, . . . ,A2dout(A)) be a (dout(A), din(A),w)-MBS. Let B =

(B1, . . . ,B2dout(B)) be a (dout(B), 0,w)-thin MBS. Let G = ([2dout(A)], [2dout(B)],E) be a left-

regular bipartite graph with left-degree 2d. Then,

σ(A ←◦G B) ≥ σ(A),

μ(A ←◦G B) ≤ μ(A),

dout(A←◦G B) = dout(A) + d,

din(A←◦G B) = din(A).

Wemake use of the following claim regarding thinness under the products→◦ ,←◦ .

Claim 5.5.3. LetA,B be a pair of (dout, 0,w)-MBSs, both thin. Let G = ([2dout ], [2dout ],E) be a

left-regular bipartite graph. Then, bothA →◦G B andA ←◦G B are thin.

Proof. ByDefinition 5.5.1, din(A→◦G B) = din(A) + din(B). As bothA,B are thin, din(A

→◦G

B) = 0. Moreover, by Definition 5.5.1, every coefficient of A →◦G B is a product of some

coefficient ofA with some coefficient of B. As bothA,B are thin, their coefficients all equal 1

and so the coefficients ofA →◦G B are all 1. The proof forA ←◦G B is similar and we omit it.

5.5.2 The multiplication rules→• ,←• parameterized by a sampler




Bi =(((βi)1, (Bi)1

), . . . ,


)).

211

LetG = ([2dout(A)], [2dout(B)],E) be a left-regular bipartite graphwith left-degree 2d. Wedefine the

(dout(A), din(A)+din(B)+d,w)-MBS, C = A →•G B as follows. For k ∈ [2din(A)], ℓ ∈ [2din(B)],

and j ∈ [2d] define

(Ci)j,k,ℓ = (2−d(αi)k(βΓG(i,j))ℓ, (Ai)k(BΓG(i,j))ℓ).

Note that C is an MBS as the product of the stochastic matrices (Ai)k, (BΓ(i,j))ℓ is stochastic.

Moreover, the dimensions of C is as asserted in the definition. That is, C is a (dout(A), din(A) +

din(B) + d,w)-MBS.

Claim 5.5.4. ⟨Ci⟩ = ⟨Ai⟩Ej∼ΓG(i) ⟨Bj⟩.

Proof.

⟨Ci⟩ =∑j,k,ℓ

2−d(αi)k(βΓG(i,j))ℓ(Ai)k(BΓG(i,j))ℓ

=∑k

(αi)k(Ai)k2−d∑j∈[2d]

∑ℓ


=

(∑k

(αi)k(Ai)k

)E

j∼ΓG(i)⟨Bj⟩

= ⟨Ai⟩ Ej∼ΓG(i)

⟨Bj⟩.

Claim 5.5.4 readily implies that

⟨A →•G B⟩ = Ei

[⟨Ai⟩ E

j∼ΓG(i)⟨Bj⟩]. (5.12)



212


Bi =(((βi)1, (Bi)1

), . . . ,


)).

LetG = ([2dout(A)], [2dout(B)],E) be a left-regular bipartite graphwith left-degree 2d. Wedefine the

(dout(A), din(A)+din(B)+d,w)-MBS C = A ←•G B as follows. For k ∈ [2din(A)], ℓ ∈ [2din(B)],

and j ∈ [2d] define

(Ci)j,k,ℓ = (2−d(αi)k(βΓG(i,j))ℓ, (BΓG(i,j))ℓ(Ai)k).

Claim 5.5.5. ⟨Ci⟩ =(Ej∼ΓG(i) ⟨Bj⟩

)⟨Ai⟩.

Proof.

⟨Ci⟩ =∑j,k,ℓ

2−d(αi)k(βΓG(i,j))ℓ(BΓG(i,j))ℓ(Ai)k

=

2−d∑j∈[2d]

∑ℓ


∑k

(αi)k(Ai)k

=

(E

j∼ΓG(i)⟨Bj⟩)⟨Ai⟩.

By Claim 5.5.5,

⟨A ←•G B⟩ = Ei

[(E

j∼ΓG(i)⟨Bj⟩)⟨Ai⟩

]. (5.13)

The following claim readily follows by Equations (5.8), (5.10), (5.12), and (5.13).

Claim 5.5.6. Let A = (A1, . . . ,A2dout(A)) be a (dout(A), din(A),w)-MBS. Let B =

(B1, . . . ,B2dout(B)) be a (dout(B), din(B),w)-MBS. Let G = ([2dout(A)], [2dout(B)],E) be a left-

regular bipartite graph. Then,

⟨A →◦G B⟩ = ⟨A →•G B⟩,

213

⟨A ←◦G B⟩ = ⟨A ←•G B⟩.

Claim 5.5.6 together with Equation (5.9) and Equation (5.11) implies that

⟨A →•K B⟩ = ⟨A⟩⟨B⟩,

⟨A ←•K B⟩ = ⟨B⟩⟨A⟩, (5.14)

whereK is the complete bipartite graph on [2dout(A)]× [2dout(B)].

The following lemma shows that the matrix that is realized by the productA →•G B approxi-

mates ⟨A⟩⟨B⟩, where the approximation guarantee is determined by the parameters of the sam-

plerG (and those ofA,B).

Lemma 5.5.3. Let A = (A1, . . . ,A2dout(A)) be a (dout(A), din(A),w)-MBS and B =

(B1, . . . ,B2dout(B)) a (dout(B), din(B),w)-MBS. Let G = ([2dout(A)], [2dout(B)],E) be an (ε, δ)-

sampler with δ ≤ 1/2. Then,

∥⟨A →•G B⟩ − ⟨A⟩⟨B⟩∥max ≤ 4w2μ(B)2

(2

μ(A)2 δ+ 2−

σ(A)2 ε). (5.15)

Furthermore, the same bound holds also for

∥⟨A ←•G B⟩ − ⟨B⟩⟨A⟩∥max,

∥⟨A →◦G B⟩ − ⟨A⟩⟨B⟩∥max,

∥⟨A ←◦G B⟩ − ⟨B⟩⟨A⟩∥max.

Proof. We prove Equation (5.15). A similar proof gives the same bound for ∥⟨A ←•G B⟩ −

⟨B⟩⟨A⟩∥max. The bound for the third and fourth expressions then follows by Claim 5.5.6. Write

C = A →•G B = (Ci)2dout(A)

i=1 and let Γ : [2dout(A)] × [2d] → [2dout(B)] be the neighborhood

function of G, where 2d is the degree of the sampler. By Claim 5.5.4, for every i ∈ [2dout(A)],

214

⟨Ci⟩ = ⟨Ai⟩Ej∼Γ(i) ⟨Bj⟩.Therefore, for every α, β ∈ [w],

⟨Ci⟩α,β =w∑

γ=1

⟨Ai⟩α,γ Ej∼Γ(i)

⟨Bj⟩γ,β,

and so

⟨C⟩α,β = Ei⟨Ci⟩α,β

=w∑

γ=1

Ei

[⟨Ai⟩α,γ E

j∼Γ(i)⟨Bj⟩γ,β

]. (5.16)

For fixed α, β, γ ∈ [w], define

εγ,β(i) = Ej∼Γ(i)

⟨Bj⟩γ,β − ⟨B⟩γ,β.

Note that |εγ,β(i)| ≤ 2μ(B)/2+1 for all i ∈ [2dout(A)]. Moreover, as ⟨B⟩γ,β = Ej∼[2dout(B)] ⟨Bj⟩γ,β

and since |⟨Bj⟩γ,β| ≤ 2μ(B)/2, Claim 5.2.2 implies that there exists a set S ⊆ [2dout(A)] with |S| ≥

(1− δ) · 2dout(A) such that for all i ∈ S, |εγ,β(i)| ≤ ε · 2μ(B)/2+1.Therefore,

Ei

[⟨Ai⟩α,γ E


]= E

i

[⟨Ai⟩α,γ

(⟨B⟩γ,β + εγ,β(i)

)]= ⟨A⟩α,γ⟨B⟩γ,β + E

i

[⟨Ai⟩α,γεγ,β(i)

]. (5.17)

As |⟨Ai⟩α,γ| ≤ 2μ(A)/2 and |εγ,β(i)| ≤ 2μ(B)/2+1 for all i ∈ [2dout(A)], we have that

Ei

[⟨Ai⟩α,γεγ,β(i)

]≤ E

i

[⟨Ai⟩α,γεγ,β(i) | i ∈ S

]+ 2

μ(A)+μ(B)2 +1 Pr[i ∈ S]

≤ Ei


]+ 2

μ(A)+μ(B)2 +1δ. (5.18)

By Jensen’s inequality, and using the fact that(⟨Ai⟩α,γ

)2 ≥ 0,

(Ei


])2≤ E

i

[(⟨Ai⟩α,γεγ,β(i)

)2 | i ∈ S]

215

≤(ε2μ(B)/2+1)2 E

i

[(⟨Ai⟩α,γ

)2 | i ∈ S]

≤(ε2μ(B)/2+1)2 Ei

[(⟨Ai⟩α,γ

)2]Pr[i ∈ S]

≤(ε2μ(B)/2+1)2 2−σ(A)

Pr[i ∈ S].

As δ ≤ 1/2, we havePr[i ∈ S] ≥ 1− δ ≥ 1/2 and so

∣∣∣Ei


]∣∣∣ ≤ 2μ(B)2 −

σ(A)2 +2ε.

Equations (5.17), (5.18) then implies

∣∣∣∣Ei[⟨Ai⟩α,γ E


]− ⟨A⟩α,γ⟨B⟩γ,β

∣∣∣∣ ≤ 2μ(A)+μ(B)

2 +1δ+ 2μ(B)2 −

σ(A)2 +2ε.

As the bound holds for all γ ∈ [w], Equation (5.16) yields

∣∣⟨C⟩α,β − (⟨A⟩⟨B⟩)α,β∣∣ ≤ 4w2

μ(B)2

(2

μ(A)2 δ+ 2−

σ(A)2 ε).

The proof follows as the bound holds for every α, β ∈ [w].

Next, we show that by taking a good enough sampler, the smallness of the productA →•G B

(and of the other products) approaches the sum σ(A) + σ(B) and that the magnitude of the

product is bounded by μ(A) + μ(B).


(B1, . . . ,B2dout(B)) a (dout(B), din(B),w)-MBS. Let τ ∈ (0, 1] and G = ([2dout(A)], [2dout(B)],E)

be an (ε, δ)-sampler with

ε ≤ 2−λ(B)−μ(B)−log(1/τ)−3,

δ ≤ 2−λ(A)−λ(B)−μ(A)−μ(B)−log(1/τ)−3, (5.19)

216

for some λ(A), λ(B) such that 0 ≤ λ(A) ≤ σ(A) and 0 ≤ λ(B) ≤ σ(B).

Then,

σ(A →•G B) ≥ λ(A) + λ(B)− τ,

μ(A →•G B) ≤ μ(A) + μ(B).

Proof. Write C = A →•G B = (Ci)2dout(A)

i=1 and let Γ : [2dout(A)] × [2d] → [2dout(B)] be the

neighborhood function ofG, where 2d is the degree of the sampler. For i ∈ [2dout(A)], define

ε(i) = Ej∼Γ(i)

∥Bj∥2∞ − 2−σ(B).

As G is an (ε, δ)-sampler, and since 0 ≤ ∥Bj∥2∞ ≤ 2μ(B) for all j ∈ [2dout(B)], Claim 5.2.2

implies that there exists a set S ⊆ [2dout(A)] with |S| ≥ (1 − δ)2dout(A) such that for every i ∈ S,

|ε(i)| ≤ ε2μ(B). By Claim 5.5.4, for every i ∈ [2dout(A)],


⟨Bj⟩.

By Jensen’s inequality and since ∥ · ∥∞ is sub-multiplicative (and sub-additive),

2−σ(C) = Ei∥Ci∥2∞

= Ei∥⟨Ai⟩ E

j∼Γ(i)⟨Bj⟩∥2∞

≤ Ei

[∥Ai∥2∞ E

j∼Γ(i)∥Bj∥2∞

].

Thus,

2−σ(C) ≤ Ei

[∥Ai∥2∞(2−σ(B) + ε(i))

]= 2−σ(A)−σ(B) + E

i

[∥Ai∥2∞ε(i)

]. (5.20)

217

As ∥Ai∥2∞ ≤ 2μ(A) and since |ε(i)| ≤ 2μ(B) for all i ∈ [2dout(A)],

Ei

[∥Ai∥2∞ε(i)

]≤ E

i

[∥Ai∥2∞ε(i)

∣∣ i ∈ S]+ 2μ(A)+μ(B) Pr[i ∈ S]

≤ ε2μ(B) Ei

[∥Ai∥2∞

∣∣ i ∈ S]+ 2μ(A)+μ(B)δ.

As δ ≤ 1/2, Pr[i ∈ S] = 1− δ ≥ 1/2 and since ∥Ai∥2∞ ≥ 0,

Ei

[∥Ai∥2∞

∣∣ i ∈ S]≤ 1

Pr[i ∈ S]Ei

[∥Ai∥2∞

]≤ 2−σ(A)+1.

Hence, Ei [∥Ai∥2∞ε(i)] ≤ 2μ(B)−σ(A)+1ε+ 2μ(A)+μ(B)δ. Plugging this to Equation (5.20), we get

2−σ(C) ≤ 2−σ(A)−σ(B) + 2μ(B)−σ(A)+1ε+ 2μ(A)+μ(B)δ.

Substituting for ε, δwe conclude

2−σ(C) ≤(1+

3τ8

)2−λ(A)−λ(B) ≤ 2−λ(A)−λ(B)+τ,

where, for the last inequality we used the fact that 1+ x ≤ ex for all x.

Wemove to analyze themagnitude. As ∥·∥∞ is sub-multiplicative (and sub-additive), for every

i ∈ [2dout(A)],

∥Ci∥2∞ ≤ ∥Ai∥2∞∥∥∥ E

j∼Γ(i)⟨Bj⟩∥∥∥2∞

≤ ∥Ai∥2∞ Ej∼Γ(i)

∥Bj∥2∞

≤ 2μ(A)+μ(B),

which implies that μ(C) ≤ μ(A) + μ(B).

218

The proof of Lemma 5.5.4, which considers the product→• can be adapted to prove the same

lemma for←• , which is given by the following lemma.

Lemma 5.5.5. Let A be a (dout(A), din(A),w)-MBS and B a (dout(B), din(B),w)-MBS. Let

τ ∈ (0, 1] and G = ([2dout(A)], [2dout(B)],E) be an (ε, δ)-sampler for which Equation (5.19) holds.

Then, σ(A ←•G B) ≥ λ(A) + λ(B)− τ and μ(A ←•G B) ≤ μ(A) + μ(B).

5.5.3 The multiplication rules→• ,←• parameterized by delta of samplers

In this section we define multiplication rules that are parameterized by the difference, or delta,

between two samplers.




Bi =(((βi)1, (Bi)1

), . . . ,


)).

Let D ≥ d ≥ 1 be integers. Let GD = ([2dout(A)], [2dout(B)],ED) be a left-regular bipartite graph

with left-degree 2D and Gd = ([2dout(A)], [2dout(B)],Ed) a left-regular bipartite graph with left-

degree 2d. We define the (dout(A), din(A) + din(B) + D + 1,w)-MBS C = A →•GD−Gd B as

follows: For i ∈ [2dout(A)], k ∈ [2din(A)], ℓ ∈ [2din(B)], and j ∈ [2D], define

(Ci)Dj,k,ℓ = (2−D(αi)k(βΓGD (i,j))ℓ, (Ai)k(BΓGD (i,j))ℓ).

For i ∈ [2dout(A)], k ∈ [2din(A)], ℓ ∈ [2din(B)], and j ∈ [2d], define

(Ci)dj,k,ℓ = (−2−d(αi)k(βΓGd (i,j))ℓ, (Ai)k(BΓGd (i,j))ℓ).

219

Finally, C = (Ci)i∈[2dout(A)] whereCi is the concatenation of the sequencesCDi ,Cd

i .

Note that C is anMBS as the stochastic property is preserved. Further, by definition,

2din(C) = 2din(A)+din(B) ·(2D + 2d

)≤ 2din(A)+din(B) · 2D+1,

and sowe indeedmay regardC as having the stateddin (see the remark regarding themonotonicity

of din in Section 5.4.2). We have the following claim.

Claim 5.5.7. With the notation of Definition 5.5.5, for every i ∈ [2dout(A)],

⟨Ci⟩ = ⟨Ai⟩(

Ej∼ΓGD (i)

⟨Bj⟩ − Ej∼ΓGd (i)

⟨Bj⟩).

Proof. Let i ∈ [2dout(A)]. As Ci is the concatenation of (Ci)D and (Ci)

d, ⟨Ci⟩ = ⟨(Ci)D⟩ +

⟨(Ci)d⟩. Thus,

⟨Ci⟩ =∑k,ℓ

∑j∈[2D]

(Ci)Dj,k,ℓ +

∑k,ℓ

∑j∈[2d]

(Ci)dj,k,ℓ

The first summand in the RHS of the above equation equals to

∑k,ℓ

∑j∈[2D]

(Ci)Dj,k,ℓ =

∑k,ℓ

∑j∈[2D]

2−D(αi)k(βΓGD (i,j))ℓ(Ai)k(BΓGD (i,j))ℓ

=∑k

(αi)k(Ai)k Ej∼ΓGD (i)

∑ℓ

(βj)ℓ(Bj)ℓ

= ⟨Ai⟩ Ej∼ΓGD (i)

⟨Bj⟩.

As for the second summand,

∑k,ℓ

∑j∈[2d]

(Ci)dj,k,ℓ =

∑k,ℓ

∑j∈[2d]

−2−d(αi)k(βΓGd (i,j))ℓ(Ai)k(BΓGd (i,j))ℓ

= −∑k

(αi)k(Ai)k Ej∼ΓGd (i)

∑ℓ

(βj)ℓ(Bj)ℓ

220

= −⟨Ai⟩ Ej∼ΓGd (i)

⟨Bj⟩,

which completes the proof.

Claim 5.5.7, together with Equation (5.12), readily yields

⟨A →•GD−Gd B⟩ = ⟨A →•GD B⟩ − ⟨A →•Gd B⟩. (5.21)

We refer to this property as the linearity of→• .


(B1, . . . ,B2dout(B)) a (dout(B), din(B),w)-MBS. Let GD = ([2dout(A)], [2dout(B)],ED) be an

(ε1, δ1)-sampler and Gd = ([2dout(A)], [2dout(B)],Ed) an (ε2, δ2)-sampler. Assume that 0 < ε1 ≤

ε2 < 1 and 0 < δ1 ≤ δ2 ≤ 1/(4w2). Denote the degrees of GD,Gd by 2D, 2d, respectively, and

assume that D ≥ d. Then,

din(A→•GD−Gd B) ≤ din(A) + din(B) +D+ 1;

dout(A→•GD−Gd B) = dout(A);

σ(A →•GD−Gd B) ≥ min(2 log

(1ε2

)+ σ(A), log

(1δ2

)− μ(A)

)− μ(B)− 2 logw− 6;

μ(A →•GD−Gd B) ≤ μ(A) + μ(B) + 2.

Proof. The assertions regarding din, dout readily follow by Definition 5.5.5 as since we assume

D ≥ d. We turn to analyze the smallness of the product. Write C = A →•GD−Gd B = (Ci)2dout(A)

i=1 .

Let Γ1 : [2dout(A)] × [2D] → [2dout(B)] be the neighborhood function of GD and Γ2 : [2dout(A)] ×

[2d] → [2dout(B)] the neighborhood function ofGd. By Claim 5.5.7, for all i ∈ [2dout(A)],

⟨Ci⟩ = ⟨Ai⟩(

Ej∼Γ1(i)

⟨Bj⟩ − Ej∼Γ2(i)

⟨Bj⟩),

221

and so, using the fact that ∥ · ∥∞ is sub-multiplicative,

∥Ci∥∞ ≤ ∥Ai∥∞∥∥∥ E

j∼Γ1(i)⟨Bj⟩ − E

j∼Γ2(i)⟨Bj⟩∥∥∥∞. (5.22)

By standard norm inequalities (see Claim 5.2.1),

∥∥∥ Ej∼Γ1(i)


⟨Bj⟩∥∥∥∞

≤ w∥∥∥ E


j∼Γ2(i)⟨Bj⟩∥∥∥max

≤ w(∥∥∥ E

j∼Γ1(i)⟨Bj⟩ − ⟨B⟩

∥∥∥max

+∥∥∥ E

j∼Γ2(i)⟨Bj⟩ − ⟨B⟩

∥∥∥max

).

(5.23)

Fix α, β ∈ [w]. For s ∈ {1, 2} and i ∈ [2dout(A)], define

εα,βs (i) = Ej∼Γs(i)

⟨Bj⟩α,β − ⟨B⟩α,β.

Note that ⟨B⟩α,β = Ej∼[2dout(B)] ⟨Bj⟩α,β. Thus, as Gs is an (εs, δs)-sampler (here, we refer to GD

by G1 and Gd by G2), and since |⟨Bj⟩α,β| ≤ 2μ(B)/2 for all j ∈ [2dout(B)], there exists a set Sα,βs ⊆

[2dout(A)] of size |Sα,βs | ≥ (1 − δs)2dout(A) such that for every i ∈ Sα,βs , |εα,βs (i)| ≤ 2μ(B)/2+1εs.

Moreover, for every i ∈ [2dout(A)], |εα,βs (i)| ≤ 2μ(B)/2+1. For s ∈ {1, 2} and i ∈ [2dout(A)], define

εs(i) = maxα,β∈[w]

∣∣εα,βs (i)∣∣.

By Equation (5.23),

∥∥∥ Ej∼Γ1(i)


⟨Bj⟩∥∥∥∞

≤ w (ε1(i) + ε2(i)) .

Let

S =w⋂

α,β=1

(Sα,β1 ∩ Sα,β2

).

222

Note that

|S| ≥(1− (δ1 + δ2)w2) 2dout(A) ≥ (1− 2δ2w2) 2dout(A). (5.24)

Moreover, for every i ∈ S,

ε1(i) + ε2(i) ≤ (ε1 + ε2)2μ(B)/2+1 ≤ ε22μ(B)/2+2. (5.25)

By Equation (5.22),

∥Ci∥2∞ ≤ ∥Ai∥2∞w2 (ε1(i) + ε2(i))2 .

Taking expectation over i ∼ [2dout(A)], we get

2−σ(C) = Ei∥Ci∥2∞

≤ w2 Ei

[∥Ai∥2∞ (ε1(i) + ε2(i))2

]≤ w2 E

i

[∥Ai∥2∞ (ε1(i) + ε2(i))2

∣∣ i ∈ S]+ 2μ(A)+μ(B)+4 Pr[i ∈ S]

≤ w2ε222μ(B)+4 E

i

[∥Ai∥2∞

∣∣ i ∈ S]+ 2μ(A)+μ(B)+4 Pr[i ∈ S], (5.26)

where, for the penultimate inequalitywe used the fact that ∥Ai∥2∞ ≤ 2μ(A) and εs(i) ≤ 2μ(B)/2+1

for all i, and the last inequality follows by Equation (5.25). By Equation (5.24),

Pr[i ∈ S] ≤ 2δ2w2. (5.27)

In particular,Pr[i ∈ S] ≥ 1/2 per our assumption on δ2. Using the fact that ∥Ai∥2∞ ≥ 0,

Ei

[∥Ai∥2∞

∣∣ i ∈ S]≤ Ei [∥Ai∥2∞]

Pr[i ∈ S]≤ 2E

i

[∥Ai∥2∞

]= 2−σ(A)+1. (5.28)

Equations (5.26), (5.27),(5.28) then imply 2−σ(C) ≤ 2μ(B)+5w2(ε222−σ(A) + 2μ(A)δ2

), which

concludes the proof regarding the smallness of C.

223

As for the magnitude, by Claim (5.5.7),

⟨Ci⟩ = ⟨Ai⟩(

Ej∼Γ1(i)


⟨Bj⟩),

and so, as ∥ · ∥∞ is sub-multiplicative (and sub-additive),

∥Ci∥∞ ≤ ∥Ai∥∞∥∥∥ E


j∼Γ2(i)⟨Bj⟩∥∥∥∞

≤ ∥Ai∥∞(∥∥∥ E

j∼Γ1(i)⟨Bj⟩∥∥∥∞+∥∥∥ E

j∼Γ2(i)⟨Bj⟩∥∥∥∞

)≤ ∥Ai∥∞

(E

j∼Γ1(i)∥Bj∥∞ + E

j∼Γ2(i)∥Bj∥∞

).

Hence, by Jensen’s inequality,

∥Ci∥2∞ ≤ ∥Ai∥2∞(

Ej∼Γ1(i)

∥Bj∥∞ + Ej∼Γ2(i)

∥Bj∥∞)2

≤ ∥Ai∥2∞ · 2

((E


)2

+

(E


)2)

≤ ∥Ai∥2∞ · 2(

Ej∼Γ1(i)

∥Bj∥2∞ + Ej∼Γ2(i)

∥Bj∥2∞)

≤ 4 · 2μ(A)+μ(B).

As this holds for all i, μ(C) ≤ μ(A) + μ(B) + 2, as claimed.




Bi =(((βi)1, (Bi)1

), . . . ,


)).

Let D ≥ d ≥ 1 be integers. Let GD = ([2dout(A)], [2dout(B)],ED) be a left-regular bipartite graph

224

with left-degree 2D and Gd = ([2dout(A)], [2dout(B)],Ed) a left-regular bipartite graph with left-

degree 2d. We define the (dout(A), din(A) + din(B) + D + 1,w)-MBS C = A ←•GD−Gd B as

follows: For i ∈ [2dout(A)], k ∈ [2din(A)], ℓ ∈ [2din(B)], and j ∈ [2D], define

(Ci)Dj,k,ℓ = (2−D(αi)k(βΓGD (i,j))ℓ, (BΓGD (i,j))ℓ(Ai)k).

For i ∈ [2dout(A)], k ∈ [2din(A)], ℓ ∈ [2din(B)], and j ∈ [2d], define

(Ci)dj,k,ℓ = (−2−d(αi)k(βΓGd (i,j))ℓ, (BΓGd (i,j))ℓ(Ai)k).

Finally, C = (Ci)i∈[2dout(A)] whereCi is the concatenation of the sequencesCDi ,Cd

i .

Similarly to the product→• , one can show that

⟨A ←•GD−Gd B⟩ = Ei

[(E

j∼ΓGD (i)⟨Bj⟩ − E

j∼ΓGd (i)⟨Bj⟩)⟨Ai⟩

],

and that ⟨A ←•GD−Gd B⟩ = ⟨A ←•GD B⟩ − ⟨A ←•Gd B⟩. The following lemma follows by similar

arguments to those used to prove Lemma 5.5.6.

Lemma 5.5.7. Let A be a (dout(A), din(A),w)-MBS and B a (dout(B), din(B),w)-MBS. Let

GD = ([2dout(A)], [2dout(B)],ED) be an (ε1, δ1)-sampler and Gd = ([2dout(A)], [2dout(B)],Ed) an

(ε2, δ2)-sampler. Assume that 0 < ε1 ≤ ε2 < 1 and 0 < δ1 ≤ δ2 ≤ 1/(4w2). Denote the degrees

of Gd,Gd by 2D, 2d, respectively, and assume that D ≥ d. Then,

din(A←•GD−Gd B) ≤ din(A) + din(B) +D+ 1;

dout(A←•GD−Gd B) = dout(A);

σ(A ←•GD−Gd B) ≥ min(2 log

(1ε2

)+ σ(A), log

(1δ2

)− μ(A)

)− μ(B)− 2 logw− 6;

μ(A ←•GD−Gd B) ≤ μ(A) + μ(B) + 2.

225

5.6 LeveledMatrix Representations

Definition 5.6.1. A (k,w)-matrix representation is a sequence A = ((a0,A0), . . . , (ak,Ak))

where:

• ai ≥ 0 are real numbers andAi areMBSs; and

• for every i ≥ 1, σ(Ai) ≥ i− (i− 1)τ, where τ = 1/(10k2).

The matrix that is realized by A is defined by ⟨A⟩ =∑k

i=0 ai⟨Ai⟩.We define the weight of A by

ϑ(A) =∑

i ai.

Remark regarding τ. Ideally, the property σ(Ai) ≥ i− (i− 1)τwould have been replaced

by σ(Ai) ≥ i which captures in a cleaner way the fact that the smallness, or more precisely, the

boundwecanguarantee on the smallness, increaseswith i. However, themachinerywedeveloped

in Section 5.5 does not allow us tomaintain such invariant. Thus, we are forced to introduce and

work with this small relaxation.

Matrix representations capture the way in which we represent matrices. However, we will

require, and maintain, some more structure. We find it useful to define this extra structure “on

top” of the basic definition rather thanmix them into one. We start with some preparations. For

integers k ≥ g ≥ 1, define the function levelk,g : {0, g, g+ 1, . . . , k} → N by

levelk,g(i) =

0, i = 0;

1+⌊log(

ig

)⌋, i ≥ 1.

When k, g are clear from context, we omit them from the subscript and simply write level(i).

Note that if i, j > 0 are such that level(i) = level(j) then i/2 ≤ j ≤ 2i. From this point on, for

simplicity, we assume that g divides k.

For ease of readability, from this point on we define the function ω(w) = 2 logw+ 6. When

w is clear from the context, we omit it and write ω instead of ω(w). We remind the reader that all

matrices considered are of order w× w.

226

Definition 5.6.2. Let k, g,w be integers such that

k ≥ g ≥ 10(ω + log k). (5.29)

A (k, g,w)-leveled matrix representation (LMR for short) A is a (k,w)-matrix representation

A = ((a0,A0), . . . , (ak,Ak)) such that

• A0 is thin and a0 = 1;

• ai = 0 for all i such that g | i; and

• μ(Ai) ≤ i.

Moreover, for every i, j ∈ {0, g, . . . , k},

• If level(i) = level(j) then dout(Ai) = dout(Aj); and

• If level(i) > level(j) then dout(Ai) ≥ dout(Aj) + 10k.

As we care mainly about ⟨A⟩, the matrix that is realized byA, whenever ai = 0 we also write

Ai = ∅.

Definition 5.6.3. Let δout, δin, μ′, ϑ : R → R be monotone non-decreasing functions. Let A =

((1,A0), . . . , (ak,Ak)) be a (k, g,w)-LMR.We say that:

• A respects the out-function δout if dout(Ai) ≤ δout(i) for all i ≥ 0;

• A respects the in-function δin if din(Ai) ≤ δin(i) for all i > 0;

• A respects the magnitude-function μ′ if μ(Ai) ≤ μ′(i) for all i > 0;

• A respects the weight-function ϑ if ai ≤ ϑ(i) for all i > 0;

• A respects (δout, δin, μ′, ϑ) if A respects the out-function δout, the in-function δin, the

magnitude-function μ′, and the weight-function ϑ.

227

Remark. Note that we do notmake any requirement of din, μ and ϑ for i = 0. This is because

in some cases the functions δin, μ′, ϑ that we work with are not well-defined at i = 0. While one

can always tweak the functions appropriately, it is cumbersome and in any case, asA is an LMR,

a0 = 1 andA0 is thin, and so din(A0) = μ(A0) = 0.

We sometimes abuse notation and also use dout, din, μ instead of introducing the notation

δout, δin, μ′. The meaning will always be clear from context.

5.7 The FamilyF(A,B)

From this point, given an integer k, we set

δ = 2−5k. (5.30)

For integers n, d, let BS(n, d) be the balanced sampler BSamp(n, 2−d, 2−d) = ([n], [n],E) that

is given by Theorem 5.2.1. By Theorem 5.2.1, the degree of BS(n, d) is O(23d). For ease of

readability, we omit n and write BS(d) whenever n is clear from context. For integers ℓ, r, d for

which ℓ ≥ r/δ2 let US(ℓ, r, d) be the sampler UBSamp(ℓ, r, 2−d, δ) = ([ℓ], [r],E) that is given

byTheorem5.2.3. ByTheorem5.2.3, the degree ofUS(ℓ, r, d) isO((2d·5k)csamp)=O((2d·k)csamp).

When ℓ, r are clear from context we omit them and writeUS(d).

Definition 5.7.1. Let A = ((1,A0), . . . , (ak,Ak)), B = ((1,B0), . . . , (bk,Bk)) be a pair of

(k, g,w)-LMRs. Assume that dout(Ai) = dout(Bi) for all i. Define F(A,B) to be the following

collection ofMBSs:

1. {A0

→◦BS(2g) B0

}∪{A0

→•BS(2r+1g)−BS(2rg) B0

∣∣∣ r = 1, . . . , log(k/g)};

2. For every j ∈ {g, 2g, . . . , k},

{Bj←◦US(g) A0

}∪{Bj←•US(2r+1g)−US(2rg) A0

∣∣∣ r = 0, 1, . . . , log(k/g)};

228

3. For every i ∈ {g, 2g, . . . , k},

{Ai→◦US(g) B0

}∪{Ai→•US(2r+1g)−US(2rg) B0

∣∣∣ r = 0, 1, . . . , log(k/g)};

4. For every i, j ∈ {g, 2g, . . . , k} such that level(i) = level(j),

{Ai→•BS(8i) Bj

}∪{Ai→•BS(2r+1·8i)−BS(2r·8i) Bj

∣∣∣ r = 0, 1, . . . , log (k/i)};

5. For i, j ∈ {g, 2g, . . . , k} such that level(i) > level(j),

{Ai→•US(8j) Bj

}∪{Ai→•US(2r+1·8j)−US(2r·8j) Bj,

∣∣∣ r = 0, 1, . . . , log(k/j)};

6. For i, j ∈ {g, 2g, . . . , k} such that level(j) > level(i),

{Bj←•US(8i) Ai

}∪{Bj←•US(2r+1·8i)−US(2r·8i) Ai,

∣∣∣ r = 0, 1, . . . , log(k/i)}.

Remark on the validity of Definition 5.7.1. The MBSs listed in Definition 5.7.1 are

obtained by multiplying MBSs where the product is parameterized by a balanced or an unbal-

anced sampler (or delta of such). Therefore, one must verify that the MBSs that are being mul-

tiplied have dout as required by the corresponding sampler. This indeed holds for all MBSs listed

in Definition 5.7.1. Indeed,

• For all products that are parameterized by an unbalanced sampler (or by the delta of such),

the requirement regarding the ratio between the sides of the sampler holds. Indeed, by the

hypothesis, and sinceA,B areLMRs, for every i, j ∈ {0, g, . . . , k}with level(i) > level(j)

it holds that

dout(Ai) ≥ dout(Aj) + 10k = dout(Bj) + 10k

(and similarly, dout(Bi) ≥ dout(Aj) + 10k). Hence, the ratio between the two sides of

229

the sampler is bounded below by 210k = δ−2, per Equation (5.30), as required by Theo-

rem 5.2.3.

• When taking a product with balanced samplers (or the delta of such), the two sides of

the samplers are of equal size, as for i, j with level(i) = level(j) it holds that dout(Ai) =

dout(Aj) = dout(Bj).

We set some useful notation. LetA,B be a pair of (k, g,w)-LMRs. For i, j ∈ {0, g, 2g, . . . , k}

we let Si,j be the sum of all matrices that are realized byMBSs in the corresponding item of Defi-

nition 5.7.1. LetC ∈ F(A,B) and let i, j ∈ {0, g, 2g, . . . , k}be such thatC is obtained by taking

the product ofAi and Bj when parameterized by some sampler or delta of such. We denote this

corresponding indices by i(C), j(C)8.

The following claim states that the sum of all MBSs in F(A,B), when weighted properly,

approximates the product ⟨A⟩⟨B⟩.

Claim 5.7.1. LetA,B be a pair of (k, g,w)-LMRs. Then,

∥∥∥⟨A⟩⟨B⟩ − k∑i,j=0

aibjSi,j∥∥∥max

≤ 8wϑ(A)ϑ(B)2−k.

Proof. For i = j = 0 we have,

S0,0 = ⟨A0→◦BS(2g) B0⟩+

log(k/g)∑r=1

⟨A0→•BS(2r+1g)−BS(2rg) B0⟩

= ⟨A0→•BS(2k) B0⟩,

where the last equality follows by Claim 5.5.6 and by the linearity of→• (see Equation (5.21)). As

BS(2k) is a (2−2k, 2−2k)-sampler, Lemma 5.5.3 implies that ∥⟨A0⟩⟨B0⟩− S0,0∥max ≤ 8w2−2k ≤

8w2−k.8By just looking at anMBSC, i(C) and j(C) are notwell defined but given the context ofC belonging toF(A,B),

these quantities are well defined and we will use the notation only in such context.

230

Similarly, for every j ∈ {g, 2g, . . . , k},

S0,j = ⟨Bj←◦US(g) A0⟩+

log(k/g)∑r=0

⟨Bj←•US(2r+1g)−US(2rg) A0⟩

= ⟨Bj←•US(2k) A0⟩.

AsUS(2k) is a (2−2k, δ)-sampler, Lemma 5.5.3 yields ∥⟨A0⟩⟨Bj⟩−S0,j∥max ≤ 8w2−k (Assuming

that μ(Bj) ≤ j ≤ k). In the same way one can show that for i ∈ {g, 2g, . . . , k}, ∥⟨Ai⟩⟨B0⟩ −

Si,0∥max ≤ 8w2−k. Consider i, j ∈ {g, 2g, . . . , k} with level(i) = level(j), namely, MBSs from

Item 4 of Definition 5.7.1. By the linearity of →• , Si,j = ⟨Ai→•BS(16k) Bj⟩ (we assumed that

dout(Ai) = dout(Aj)whenever level(i) = level(j)) and so, by Lemma 5.5.3,

∥⟨Ai⟩⟨Bj⟩ − Si,j∥max ≤ 4w2μ(Bj)

2 ·(2

μ(Ai)2 −16k + 2−

σ(Ai)2 −16k

)≤ 8w2−k.

The same bound can be shown to hold for MBSs from Items 5,6 of Definition 5.7.1. We show

here for Item 5. By the linearity of→• , Si,j = ⟨Ai→•US(16k) Bj⟩ and so, by Lemma 5.5.3,

∥⟨Ai⟩⟨Bj⟩ − Si,j∥max ≤ 4w2μ(Bj)

2 ·(2

μ(Ai)2 −5k + 2−

σ(Ai)2 −16k

)≤ 8w2−k.

Thus, altogether we established that for every i, j ∈ {0, g, 2g, . . . , k},

∥⟨Ai⟩⟨Bj⟩ − Si,j∥max ≤ 8w2−k. (5.31)

Now,

⟨A⟩⟨B⟩ =

(k∑

i=0

ai⟨Ai⟩

) k∑j=0

bj⟨Bj⟩

=

k∑i,j=0

aibj⟨Ai⟩⟨Bj⟩.

231

Equation (5.31) together with the triangle inequality then implies

∥∥∥⟨A⟩⟨B⟩ − k∑i,j=0


≤k∑

i,j=0

aibj∥⟨Ai⟩⟨Bj⟩ − Si,j∥max

≤ 8wϑ(A)ϑ(B)2−k.

5.7.1 Basic properties of theMBSs inF(A,B): dout, din, μ, σ

In this section we give a series of claims that analyze the MBSs in F(A,B) in terms of their

dout, din, magnitude μ, and smallness σ. Throughout this section, A,B is a pair of (k, g,w)-

LMRs as in Definition 5.7.1. We further recall that δ = 2−5k per Equation (5.30) and that

τ = 1/(10k2) per Definition 5.6.2. We start by considering the MBSs that are given in Item 1 of

Definition 5.7.1.

Claim 5.7.2. The MBS A0→◦BS(2g) B0 is thin and dout(A0

→◦BS(2g) B0) ≤ dout(A0) + 7g.

Moreover, for every r ∈ {1, . . . , log(k/g)},

din(A0


)≤ 2r+3g;

dout(A0


)= dout(A0);

σ(A0


)≥ 2rg− ω;

μ(A0


)≤ 2.

Proof. As both A0,B0 are thin, Claim 5.5.3 implies that A0→◦BS(2g) B0 is thin. As the sam-

pler BS(2g) has degree O(26g) which we assume is bounded by 27g, Lemma 5.5.1 implies that

dout(A0→◦BS(2g) B0) ≤ dout(A0) + 7g, as stated. Moving to the moreover part, fix r ∈

{1, . . . , log(k/g)} and write C = A0→•BS(2r+1g)−BS(2rg) B0. By Lemma 5.5.6, whose hypoth-

esis is satisfied per Equation (5.29), and using the fact that A0,B0 are thin, we get din(C) =

232

3 · 2r+1g+ O(1), which yields the stated bound. The assertion regarding dout(C) readily follows

by Definition 5.5.5. As for the smallness, Lemma 5.5.6 implies that σ(C) ≥ 2rg − ω. Lastly, as

μ(A0) = μ(B0) = 0, Lemma 5.5.6 implies that μ(C) ≤ 2.

Claim 5.7.3. For every j ∈ {g, 2g, . . . , k},

din(Bj←◦US(g) A0

)= din(Bj);

dout(Bj←◦US(g) A0

)≤ dout(Bj) + 2csampg;

σ(Bj←◦US(g) A0

)≥ σ(Bj);

μ(Bj←◦US(g) A0

)≤ μ(Bj).

Moreover, for every r ∈ {0, 1, . . . , log(k/g)},

din(Bj←•US(2r+1g)−US(2rg) A0

)≤ din(Bj) + csamp2r+2g;

dout(Bj←•US(2r+1g)−US(2rg) A0

)= dout(Bj);

σ(Bj←•US(2r+1g)−US(2rg) A0

)≥ min

(σ(Bj) + 2rg, k+ 1

);

μ(Bj←•US(2r+1g)−US(2rg) A0

)≤ μ(Bj) + 2.

Proof. As A0 is thin, Lemma 5.5.2 implies the assertions regarding din(Bj←◦US(g) A0),

σ(Bj←◦US(g) A0) andμ(Bj

←◦US(g) A0). The bounddout(Bj←◦US(g) A0) ≤ dout(Bj)+2csampg fol-

lows as the degree ofUS(g) isO((2g · k)csamp) ≤ 22csampg, where we used the fact that g ≥ 10 log k.

Moving to the moreover part of the claim, denote C = Bj←•US(2r+1g)−US(2rg) A0. Recall that

the degree of US(2r+1g) is O((22r+1g · k)csamp) ≤ 2csamp2r+2g9 per our assumption g ≥ 10 log k.

Lemma 5.5.7 then implies that din(C) ≤ din(Bj) + csamp2r+2g. The assertion regarding dout(C)

follows by definition, and the bound on the magnitude follows by Lemma 5.5.7 and sinceA0 is9g is larger than a large enough constant.

233

thin. As for the smallness, by Lemma 5.5.7,

σ(C) ≥ min(2r+1g+ σ(Bj), 5k− μ(Bj)

)− ω

≥ min(σ(Bj) + 2rg, k+ 1

),

where in the above inequalityweused thehypothesis g ≥ ω and thatμ(Bj)+ω ≤ j+ω ≤ 2k.

Claim 5.7.4. For every i ∈ {g, 2g, . . . , k},

din(Ai→◦US(g) B0

)= din(Ai);

dout(Ai→◦US(g) B0

)≤ dout(Ai) + 2csampg;

σ(Ai→◦US(g) B0

)≥ σ(Ai);

μ(Ai→◦US(g) B0

)≤ μ(Ai).

Moreover, for every r ∈ {0, 1, . . . , log(k/g)},

din(Ai→•US(2r+1g)−US(2rg) B0

)≤ din(Ai) + csamp2r+2g;

dout(Ai→•US(2r+1g)−US(2rg) B0

)= dout(Ai);

σ(Ai→•US(2r+1g)−US(2rg) B0

)≥ min (σ(Ai) + 2rg, k+ 1) ;

μ(Ai→•US(2r+1g)−US(2rg) B0

)≤ μ(Ai) + 2.

The proof of Claim 5.7.4 is similar to the proof of Claim 5.7.3 and we omit it.

Claim 5.7.5. For every i, j ∈ {g, 2g, . . . , k} such that level(i) = level(j),

din(Ai→•BS(8i) Bj

)≤ din(Ai) + din(Bj) + 25i;

dout(Ai→•BS(8i) Bj

)= dout(Ai);

σ(Ai→•BS(8i) Bj

)≥ min(σ(Ai), i) + min(σ(Bj), j)− τ;

234

μ(Ai→•BS(8i) Bj

)≤ μ(Ai) + μ(Bj).

Moreover, for every r ∈ {0, 1, . . . , log(k/i)},

din(Ai→•BS(2r+1·8i)−BS(2r·8i) Bj

)≤ din(Ai) + din(Bj) + 50i · 2r;

dout(Ai→•BS(2r+1·8i)−BS(2r·8i) Bj

)= dout(Ai);

σ(Ai→•BS(2r+1·8i)−BS(2r·8i) Bj

)≥ 2r+2i;

μ(Ai→•BS(2r+1·8i)−BS(2r·8i) Bj

)≤ μ(Ai) + μ(Bj) + 2.

Proof. We wish to invoke Lemma 5.5.4. Thus, we first must verify that Equation (5.19) holds

for λ(Ai) = min(σ(Ai), i) and λ(Bj) = min(σ(Bj), j). As BS(8i) is a (2−8i, 2−8i)-sampler, it

suffices to check that

8i ≥ i+ j+ μ(Ai) + μ(Bj) + log(1/τ) + 3.

As level(i) = level(j)we have j ≤ 2i. Since μ(Ai) ≤ i and μ(Bj) ≤ j it holds that

i+ j+ μ(Ai) + μ(Bj) + log(1/τ) + 3 ≤ 6i+ 2 log k+ 7,

where we have used the remark regarding σ that appears after Definition 5.6.1. As i ≥ g ≥

10 log k, the RHS is indeed bounded by 8i. Lemma 5.5.4 then implies the assertion regarding

the smallness and magnitude of Ai→•BS(8i) Bj. The assertion regarding dout(Ai

→•BS(8i) Bj)

follows by Definition 5.5.3. Since the degree of BS(8i) is O(224i) which we assume is bounded

by 225i, the bound on din(Ai→•BS(8i) Bj) follows.

Fix r ∈ {0, 1, . . . , log(k/i)} and write C = Ai→•BS(2r+1·8i)−BS(2r·8i) Bj. Recall that the degree

of BS(2r+1 · 8i) is O(23·2r+1·8i) ≤ 249i·2r . Therefore, Lemma 5.5.6 implies the asserted bound

on din(C). The bound on dout(C) follows by Definition 5.5.5, and the bound on μ(C) readily

235

follows by Lemma 5.5.6. As for the smallness, by Lemma 5.5.6,

σ(C) ≥ min(σ(Ai) + 2r+4i, 2r+3i− μ(Ai)

)− μ(Bj)− ω

= 2r+3i− μ(Ai)− μ(Bj)− ω

≥ 2r+3i− 4i

≥ 2r+2i,

where we used the fact that σ(Ai) ≥ i − (i − 1)τ, μ(Ai) ≤ i, μ(Bj) ≤ j ≤ 2i which follows as

level(i) = level(j), and that i ≥ g ≥ ω.

Claim 5.7.6. For every i, j ∈ {g, 2g, . . . , k} such that level(i) > level(j),

din(Ai→•US(8j) Bj

)≤ din(Ai) + din(Bj) + 9csampj;

dout(Ai→•US(8j) Bj

)= dout(Ai);

σ(Ai→•US(8j) Bj


μ(Ai→•US(8j) Bj

)≤ μ(Ai) + μ(Bj).

Moreover, for every r ∈ {0, 1, . . . , log(k/j)},

din(Ai→•US(2r+1·8j)−US(2r·8j) Bj

)≤ din(Ai) + din(Bj) + csamp2r+5j;

dout(Ai→•US(2r+1·8j)−US(2r·8j) Bj

)= dout(Ai);

σ(Ai→•US(2r+1·8j)−US(2r·8j) Bj

)≥ min

(σ(Ai) + 2r+3j, k+ 1

);

μ(Ai→•US(2r+1·8j)−US(2r·8j) Bj

)≤ μ(Ai) + μ(Bj) + 2.

Proof. Recall that US(8j) is a (2−8j, δ)-sampler where δ = 2−5k. To invoke Lemma 5.5.4,

we must first verify that Equation (5.19) holds for λ(Ai) = min(σ(Ai), i) and λ(Bj) =

236

min(σ(Bj), j), namely,

8j ≥ j+ μ(Bj) + log(1/τ) + 3,

5k ≥ i+ j+ μ(Ai) + μ(Bj) + log(1/τ) + 3.

The first inequality holds as

j+ μ(Bj) + log(1/τ) + 3 ≤ 2j+ log(10k2) + 3,

which is indeed bounded above by 8j as j ≥ g ≥ 10 log k (see the remark regarding σ that appears

after Definition 5.6.1). As for the second inequality,

i+ j+ μ(Ai) + μ(Bj) + log(1/τ) + 3 ≤ 2i+ 2j+ log(1/τ) + 3

≤ 4k+ log(10k2) + 3

≤ 5k.

Thus, the asserted bounds regarding the smallness and magnitude of Ai→•US(8j) Bj follow by

Lemma 5.5.4. That dout(Ai→•US(8j) Bj) = dout(Ai) follows by Definition 5.5.3. As for

din(Ai→•US(8j) Bj), recall that the degree of the sampler US(8j) is O((28j · k)csamp) ≤ 29csampj,

where the inequality follows as j ≥ g ≥ 10 log k. The bound on din(Ai→•US(8j) Bj) then follows

by Definition 5.5.3.

Write C = Ai→•US(2r+1·8j)−US(2r·8j) Bj. The bounds on dout(C), μ(C) readily follow by

Lemma 5.5.6. As US(2r+1 · 8j) has degree O((22r+1·8j · k)csamp), Lemma 5.5.6 implies the stated

bound on din(C). As for σ(C), by Lemma 5.5.6,

σ(C) ≥ min(2r+4j+ σ(Ai), 5k− μ(Ai)

)− μ(Bj)− ω

≥ min(σ(Ai) + 2r+3j, k+ 1

),

237


Claim 5.7.7. For every i, j ≥ g such that level(i) < level(j),

din(Bj←•US(8i) Ai

)≤ din(Ai) + din(Bj) + 9csampi;

dout(Bj←•US(8i) Ai

)= dout(Bj);

σ(Bj←•US(8i) Ai


μ(Bj←•US(8i) Ai

)≤ μ(Ai) + μ(Bj).

Moreover, for every r ∈ {0, 1, . . . , log(k/i)},

din(Bj←•US(2r+1·8i)−US(2r·8i) Ai

)≤ din(Ai) + din(Bj) + csamp2r+5i;

dout(Bj←•US(2r+1·8i)−US(2r·8i) Ai

)= dout(Bj);

σ(Bj←•US(2r+1·8i)−US(2r·8i) Ai

)≥ min

(σ(Bj) + 2r+3i, k+ 1

);

μ(Bj←•US(2r+1·8i)−US(2r·8i) Ai

)≤ μ(Ai) + μ(Bj) + 2.

The proof of Claim 5.7.7 is similar to the proof of Claim 5.7.6 and we omit the details.

5.7.2 The slices ofF(A,B)

In this section we define the s-slice of F(A,B) that, roughly speaking, consists of all MBSs C ∈

F(A,B) for which s is the best lower bound we can give on the σ(C).


(k, g,w)-LMRs. Let s ∈ {0, 1, . . . , k}. DefineFs(A,B) to be the following collection ofMBSs:

1. A0→◦BS(2g) B0 if s = 0, andA0

→•BS(2r+1g)−BS(2rg) B0 if there is r ∈ {1, . . . , log(k/g)}

such that s = (2r − 1)g;

2. Bs←◦US(g) A0, and Bj

←•US(2r+1g)−US(2rg) A0 for all r ∈ {0, 1, . . . , log(k/g)} and j ∈

{g, 2g, . . . , k} such that j+ 2rg = s;

238

3. As→◦US(g) B0, and Ai

→•US(2r+1g)−US(2rg) B0 for all r ∈ {0, 1, . . . , log(k/g)} and i ∈

{g, 2g, . . . , k} such that i+ 2rg = s;

4. Ai→•BS(8i) Bj for every i, j ∈ {g, 2g, . . . , k} such that level(i) = level(j) and i +

j = s, as well as Ai→•BS(2r+1·8i)−BS(2r·8i) Bj for every i, j ∈ {g, 2g, . . . , k} and r ∈

{0, 1, . . . , log(k/i)} such that level(i) = level(j) and 2r+2i = s.

5. Ai→•US(8j) Bj for every i, j ∈ {g, 2g, . . . , k} such that level(i) > level(j) and i +

j = s, as well as Ai→•US(2r+1·8j)−US(2r·8j) Bj for every i, j ∈ {g, 2g, . . . , k} and r ∈

{0, 1, . . . , log(k/j)} such that level(i) > level(j) and i+ 2r+3j = s.

6. Bj←•US(8i) Ai for every i, j ∈ {g, 2g, . . . , k} such that level(j) > level(i) and i +

j = s, as well as Bj←•US(2r+1·8i)−US(2r·8i) Ai for every i, j ∈ {g, 2g, . . . , k} and r ∈

{0, 1, . . . , log(k/i)} such that level(j) > level(i) and j+ 2r+3i = s.

We start by analyzing the slices ofF(A,B).

Claim 5.7.8. LetA,B be a pair of (k, g,w)-LMRs. Then, for every s : g | s,Fs(A,B) = ∅.

Proof. By inspecting theMBSs inDefinition 5.7.2, one can readily see that theMBSs inFs(A,B)

are products of MBSs Ai, Bj such that ai + bj + cg = s for some integers a, b, c. As A,B are

(k, g,w)-LMRs, both i, j are divisible by g and so s is also divisible by g. Put differently, for s not

divisible by g, the collectionFs(A,B) is empty.

Claim 5.7.9. Let A,B be a pair of (k, g,w)-LMRs. Then, for every s ∈ {g, 2g, . . . , k} and C ∈

Fs(A,B), it holds that

σ(C) ≥ s− (s− 1)τ.

Moreover,

{C ∈ F(A,B) | σ(C) ≤ k} ⊆⋃

s={0,g,2g,...,k}

Fs(A,B). (5.32)

Proof. Consider the MBS C = A0→•BS(2r+1g)−BS(2rg) B0 where r ∈ {1, . . . , log(k/g)} is

such that s = (2r − 1)g. By Claim 5.7.2, σ(C) ≥ s as desired. By Claim 5.7.3, σ(Bs←◦US(g)

239

A0) ≥ σ(Bs). As B is an LMR, σ(Bs) ≥ s − (s − 1)τ, as desired. Consider the MBS

C = Bj←•US(2r+1g)−US(2rg) A0 for r ∈ {0, 1, . . . , log(k/g)} and j ∈ {g, 2g, . . . , k} such that

j + 2rg = s. By Claim 5.7.3, σ(C) ≥ min(σ(Bj) + 2rg, k + 1). If σ(Bj) + 2rg > k + 1 then

σ(C) > k > s and we are done. Otherwise, using that B is an LMR,

σ(C) ≥ σ(Bj) + 2rg

≥ j− (j− 1)τ+ 2rg

≥ s− (s− 1)τ.

A similar argument can be used to prove the assertion for MBSs from Item 3 of Definition 5.7.2

and we omit the details.

Consider now the MBS Ai→•BS(8i) Bj where i, j ∈ {g, 2g, . . . , k} are such that level(i) =

level(j) and i+ j = s. By Claim 5.7.5 and sinceA,B are LMRs,

σ(Ai→•BS(8i) Bj) ≥ min(σ(Ai), i) + min(σ(Bj), j)− τ

≥ i− (i− 1)τ+ j− (j− 1)τ− τ

= s− (s− 1)τ,

as stated. Let C = Ai→•BS(2r+1·8i)−BS(2r·8i) Bj where i, j ∈ {g, 2g, . . . , k} and r ∈

{0, 1, . . . , log(k/i)} are such that level(i) = level(j) and 2r+2i = s. By Claim 5.7.5, σ(C) ≥

2r+2i = s, as desired. A similar argument can be used for the remaining MBSs in Fs(A,B), for

which level(i) = level(j), and we omit the details.

Moving to themoreover part, a careful inspection ofDefinition 5.7.1, Definition 5.7.2 and the

claims in Section 5.7.1 yields that we did not “leave out” anyMBS of smallness not larger than k

in Definition 5.7.2. This, together with the fact that σ(C) ≥ s − (s − 1)τ for all s ∈ Fs(A,B),

yields

{C ∈ F(A,B) | σ(C) ≤ k} ⊆k⋃

s=0

Fs(A,B).

240

We omit the details of the proof. Equation (5.32) then follows by Claim 5.7.8.

Claim 5.7.10. Let A,B be a pair of (k, g,w)-LMRs. Then, for every s ∈ {g, 2g, . . . , k},

|Fs(A,B)| = O((s/g)2).

Proof. Clearly, Item 1 inDefinition 5.7.2 contributes at most oneMBS toFs(A,B). As for Item

2, for every fixed j, the number of MBSs contributed is one. As B is an LMR, we only need to

consider j that is divisible by g and so the total number ofMBSs contributedby Item2 isO((s/g)).

AsA is also an LMR, a similar argument gives the same bound on the number of MBSs coming

from Item 3.

Moving on to Item 4, the number of MBSs of the formAi→•BS(8i) Bj is equal to the number

of solutions to i + j = s. As i, j are divisible by g, the number of solutions is O(s/g). The

remainingMBSs in Item 4 are of the formAi→•BS(2r+1·8i)−BS(2r·8i) Bj where i, j ∈ {g, 2g, . . . , k},

level(i) = level(j), and r ∈ {0, 1, . . . , log(k/i)} is such that 2r+2i = s. As i ≥ g and i is divisible

by g, the number of (i, r) pairs the satisfy the latter equation isO(s/g). For every such (i, r) pair,

the number of j’s for which level(i) = level(j) is O(i/g). Indeed the latter constraint implies

that i/2 ≤ j ≤ 2i, and j is divisible by g. Summing over all these values, we conclude that the

total number of MBSs of the latter form is O((s/g)2). Similar arguments can be used to bound

the number of MBSs from Item 5 and Item 6 byO((s/g)2) and we omit the details.

5.7.3 Analyzing dout, din, μ of the slices ofF(A,B)

In this section we further analyze the MBSs in Fs(A,B) based on the calculations done in Sec-

tion 5.7.1. We start by analyzing din(C) for MBSs C ∈ Fs(A,B). Then, in Claim 5.7.12 and

Claim 5.7.13, we analyze dout(C) and μ(C), respectively.

Claim 5.7.11. Let cin = 100csamp, where csamp ≥ 1 is the constant from Theorem 5.2.3. Assume

thatA,B is a pair of (k, g,w)-LMRs that respect the in-function din(i) = cini log i. Then, for every

s ∈ {g, 2g, . . . , k} and C ∈ Fs(A,B), din(C) ≤ cins log s.

241

Proof. Consider the MBS C = A0→•BS(2r+1g)−BS(2rg) B0 with s = (2r − 1)g, assuming such r

exists, as defined in Item 1 of Definition 5.7.2. By Claim 5.7.2, din(C) ≤ 2r+3g. It is therefore

suffices to show that

2r+3g ≤ cin(2r − 1)g log((2r − 1)g),

which holds as cin ≥ 16.

Moving to Item 2 of Definition 5.7.2, consider the MBS Bs←◦US(g) A0. By Claim 5.7.3,

din(Bs←◦US(g) A0) = din(Bs) which by the hypothesis is bounded above by cins log s, as de-

sired. Now, let C = Bj←•US(2r+1g)−US(2rg) A0 where r ∈ {0, 1, . . . , log(k/g)} and j are such that

s = j+ 2rg. By Claim 5.7.3, din(C) ≤ din(Bj) + csamp2r+2g. It is therefore suffices to prove that

din(Bj) + csamp2r+2g ≤ cin(j+ 2rg) log (j+ 2rg).

As B respects the in-function din(j) = cinj log j, it suffices to show that csamp2r+2g ≤ cin2rg,

which holds by our choice of cin. A similar calculation, using Claim 5.7.4, can be applied for

analyzing the MBSs that are given by Item 3 of Definition 5.7.1. We omit the details.

Take i, j ∈ {g, 2g, . . . , k} with level(i) = level(j) such that i + j = s. Consider the MBS

C = Ai→•BS(8i) Bj. By Claim 5.7.5, din(C) ≤ din(Ai)+din(Bj)+25i, and so we ought to show

that

cini log i+ cinj log j+ 25i ≤ cin(i+ j) log(i+ j).

Observe that it suffices to prove that the above equation holds for i ≥ j. Rearranging, and using

the fact that j ≤ i ≤ k, it suffices to verify that

25i ≤ cini log(1+

ji

).

As level(i) = level(j), j ≥ i/2 and so one only needs to verify that 25i ≤ cini/2,which holds as

cin ≥ 50.

Consider an MBS of the form C = Ai→•BS(2r+1·8i)−BS(2r·8i) Bj where i, j ∈ {g, 2g, . . . , k}

242

and r ∈ {0, 1, . . . , log(k/i)} are such that level(i) = level(j) and 2r+2i = s. By Claim 5.7.5,

din(C) ≤ din(Ai) + din(Bj) + 50i · 2r.Therefore, we ought to prove that

cini log i+ cinj log j+ 50i · 2r ≤ cin2r+2i log(2r+2i).

As level(i) = level(j), j ≤ 2i, and so it suffices to verify that

3cini log(2i) + 50i · 2r ≤ cin2r+2i log i

which holds since cin ≥ 50 and r ≥ 0.

Moving on to Item 5 of Definition 5.7.2, consider i, j ∈ {g, 2g, . . . , k} such that level(i) >

level(j) and i + j = s. Let C = Ai→•US(8j) Bj. By Claim 5.7.6, din(C) ≤ din(Ai) + din(Bj) +

9csampj. It is therefore suffices to show that

cini log i+ cinj log j+ 9csampj ≤ cin(i+ j) log(i+ j).

Rearranging, it suffices to verify that

9csampj ≤ cini log(1+

ji

).

Using the inequality log2(1+ x) ≥ x/(1+ x)which holds for all x ≥ 0, it suffices to prove that

9csampj ≤ cinij

i+ j.

The above inequality holds as i ≥ j and cin ≥ 18csamp.

Consider now anMBS of the form C = Ai→•US(2r+1·8j)−US(2r·8j) Bjwhere i, j ∈ {g, 2g, . . . , k}

and r ∈ {0, 1, . . . , log(k/j)} are such that level(i) > level(j) and i+ 2r+3j = s. By Claim 5.7.6,

243

din(C) ≤ din(Ai) + din(Bj) + csamp2r+5j.Hence, we ought to prove that

cini log i+ cinj log j+ csamp2r+5j ≤ cin(i+ 2r+3j) log (i+ 2r+3j).

Rearranging, it is sufficient to show that

cinj log j+ csamp2r+5j ≤ cin2r+3j log i.

which readily follows. The remainingMBSs inFs(A,B), givenby Item6, follow a similar analysis

and we omit the details.

In the following claim we turn to analyze dout(C) for MBSs C ∈ Fs(A,B).

Claim 5.7.12. LetA,B be a pair of (k, g,w)-LMRs that respect the out-function dout(i) = 10k ·

level(i) + d for some integer d. Then, for every s ∈ {0, g, 2g, . . . , k} andMBS C ∈ Fs(A,B),

dout(C) ≤ 10k · level(s) + d+ 7csampg.

Proof. By inspecting the claims in Section 5.7.1, one can verify that if C ∈ Fs(A,B) is such that

both i(C), j(C) are non-zero then dout(C) = max(dout(Ai), dout(Bj)

)whereas s ≥ max (i, j)

in which case the proof readily follows. Hence, we only need to consider C such that at least one

of i(C), j(C) is zero. Consider the MBSA0→◦BS(2g) B0. By Claim 5.7.2,

dout(A0→◦BS(2g) B0) ≤ dout(A0) + 7g ≤ d+ 7g.

As csamp ≥ 1 and level(0) = 0, the proof for this MBS follows. The assertion for MBSs of the

form C = A0→•BS(2r+1g)−BS(2rg) B0 readily follows as by Claim 5.7.2, dout(C) = dout(A0) ≤ d.

Moving to Item 2 of Definition 5.7.2, consider the MBS Bs←◦US(g) A0. By Claim 5.7.3,

dout(Bs←◦US(g) A0) ≤ dout(Bs) + 2csampg

244

≤ 10k · level(s) + d+ 2csampg,

as desired. Let C = Bj←•US(2r+1g)−US(2rg) A0 where r ∈ {0, 1, . . . , log(k/g)} and j ∈

{g, 2g, . . . , k} are such that j + 2rg = s. By Claim 5.7.3, dout(C) = dout(Bj) which together

with the fact that s ≥ j, completes the proof for C. A similar argument proves the claim forMBSs

from Item 3 of Definition 5.7.2 and we omit the details.

Claim 5.7.13. Let A,B be a pair of (k, g,w)-LMRs that respect the magnitude-function μ(i) =

2i/g. Then, for every s ∈ {0, g, 2g, . . . , k} andMBS C ∈ Fs(A,B), it holds that μ(C) ≤ 2s/g.

Proof. By Claim 5.7.2, the MBS A0→◦BS(2g) B0 is thin and so the assertion readily follows for

it. Consider the MBS C = A0→•BS(2r+1g)−BS(2rg) B0 where r ∈ {1, . . . , log(k/g)} is such that

s = (2r − 1)g. The assertion for C follows as by Claim 5.7.2, μ(C) ≤ 2.

By Claim 5.7.3, μ(Bs←◦US(g) A0) ≤ μ(Bs) and so the claim readily follows in this case.

Now, consider the MBS C = Bj←•US(2r+1g)−US(2rg) A0 where r ∈ {0, 1, . . . , log(k/g)} and

j ∈ {g, 2g, . . . , k} are such that j+ 2rg = s. By Claim 5.7.3,

μ(C) ≤ μ(Bj) + 2 ≤ 2jg+ 2 ≤ 2s

g,

where the last inequality holds as s ≥ j+g. A similar argument, based onClaim5.7.4, can be used

to analyzeMBSs fromItem3ofDefinition5.7.2. Let i, j ∈ {g, 2g, . . . , k}be such that level(i) =

level(j) and i + j = s. Denote C = Ai→•BS(8i) Bj. By Claim 5.7.5, μ(C) ≤ μ(Ai) + μ(Bj) and

so, it suffices to verify that μ(Ai) + μ(Bj) ≤ 2(i+ j)/g,which readily holds by the hypothesis.

Denote C = Ai→•BS(2r+1·8i)−BS(2r·8i) Bj where i, j ∈ {g, 2g, . . . , k} and r ∈

{0, 1, . . . , log(k/i)} are such that level(i) = level(j) and 2r+2i = s. By Claim 5.7.5, μ(C) ≤

μ(Ai) + μ(Bj) + 2. Hence, it suffices to prove that

μ(Ai) + μ(Bj) + 2 ≤ 2r+3ig

.

245

As level(i) = level(j), j ≤ 2i and so, using the hypothesis, it suffices to show that

6ig+ 2 ≤ 2r+3i

g,

which holds as r ≥ 0 and i ≥ g.

Consider the MBS C = Ai→•US(8j) Bj where i, j ∈ {g, 2g, . . . , k} are such that level(i) >

level(j) and i + j = s. By Claim 5.7.6, we have the same bound on μ(C) as we have for the

MBS Ai→•BS(8i) Bj which we analyzed above, and so the exact same analysis can be used for

it. Now, consider the MBS C = Ai→•US(2r+1·8j)−US(2r·8j) Bj where i, j ∈ {g, 2g, . . . , k} and

r ∈ {0, 1, . . . , log(k/j)} are such that level(i) > level(j) and i + 2r+3j = s. By Claim 5.7.6,

μ(C) ≤ μ(Ai) + μ(Bj) + 2. Therefore, it suffices to prove that

μ(Ai) + μ(Bj) + 2 ≤ 2(i+ 2r+3j)g

.

Using the hypothesis, it suffices to verify that

2jg+ 2 ≤ 2r+4j

g

which holds as j ≥ g and r ≥ 0. MBSs from Item 6 of Definition 5.7.2 follow a similar analysis.

We omit the details.

5.8 TheMultiplication Rule for LeveledMatrix Representations

In this section we define a product rule between a pair of LMRsA,B, which we denote byA ·B,

based on the definition of F(A,B) and its slices. Following the definition of A · B, we prove in

Claim 5.8.1 that the product is indeed an LMR and show that it respects certain out-function

andmagnitude function. InClaim 5.8.2we prove that ⟨A·B⟩ approximates ⟨A⟩⟨B⟩. Theweight

function ofA ·B is analyzed in Claim 5.8.3. Lastly, we collect all the results in Proposition 5.8.1.


246

(k, g,w)-LMRs. We define the sequenceC = A · B = ((c0, C0), . . . , (ck, Ck)) , where ci ∈ R and

Ci MBSs, as follows. For s ∈ {0, g, 2g, . . . , k} let

ms = max(ai(D)bj(D) | D ∈ Fs(A,B)

).

Define

Cs = glue

(ai(D)bj(D)

msD∣∣∣ D ∈ Fs(A,B)

)and cs = |Fs(A,B)| ·ms.

For the glue operation to be defined above, we assume that ∀D ∈ Fs(A,B), we pad to make

all the dout’s and din’s to be equal to the maximum.

Claim 5.8.1. Let A,B be a pair of (k, g,w)-LMRs that respect the magnitude-function μ(i) =

2i/g and the out-function dout(i) = 10k · level(i) + d for some integer d. Then, the sequence C

is a (k, g,w)-LMR. Furthermore,C respects the out-function d′out(i) = dout(i) + 8csampg and the

same magnitude-function μ(i) = 2i/g.

Proof. We start by proving that C is a (k,w)-matrix representation. First, by definition, cs ≥ 0

for all s. Second, we ought to show that for all s ≥ 1, σ(Cs) ≥ s − (s − 1)τ. By Claim 5.7.9 for

everyD ∈ Fs(A,B), σ(D) ≥ s− (s− 1)τ. Claim 5.4.4 and Claim 5.4.2 then imply that

σ(Cs) = σ(glue

(ai(D)bj(D)


))≥ min

(σ(ai(D)bj(D)

msD) ∣∣∣ D ∈ Fs(A,B)

)= min

(σ(D) + 2 log

(ms

ai(D)bj(D)

) ∣∣∣ D ∈ Fs(A,B))

≥ min (σ(D) | D ∈ Fs(A,B))

≥ s− (s− 1)τ.

This proves thatC is a (k,w)-matrix representation.

247

We turn to show thatC is in fact a (k, g,w)-LMR. To this end, note that by Definition 5.7.2,

C0 = A0→◦BS(2g) B0. Hence, by Claim 5.7.2, C0 is thin. Now, as c0 = a0b0 and since A,B are

LMRs, we have that c0 = 1. Moreover, byClaim 5.7.8, for every s not divisible by g, cs = 0. Next,

we ought to show that μ(Cs) ≤ s for all s ≥ 0. This clearly holds for s = 0 as C0 is thin. Consider

s ≥ g. By Claim 5.7.13 and by the hypothesis, for every D ∈ Fs(A,B), μ(D) ≤ 2s/g ≤ s.

Therefore, by Claim 5.4.4 and Claim 5.4.2,

μ(Cs) = μ(glue

(ai(D)bj(D)


))≤ max

(μ(ai(D)bj(D)

msD) ∣∣∣ D ∈ Fs(A,B)

)≤ max

(μ(D)− 2 log

(ms

ai(D)bj(D)

) ∣∣∣ D ∈ Fs(A,B))

≤ max (μ(D) | D ∈ Fs(A,B))

≤ 2s/g,

which is bounded by s, as desired. The above equation also proves thatC respects themagnitude-

function μ(s) = 2s/g. By Claim 5.7.12 and by the hypothesis, for every s ∈ {0, g, 2g, . . . , k}

andMBSD ∈ Fs(A,B),

dout(D) ≤ 10k · level(s) + d+ 7csampg.

Claim 5.4.4 and Claim 5.7.10, together with the hypothesis g ≥ 10 log k, then imply that

dout(Cs) ≤ 10k · level(s) + d+ 7csampg+ log |Fs(A,B)|

≤ 10k · level(s) + d+ 7csampg+ 4 log k

≤ 10k · level(s) + d+ 8csampg.

Here, we used that |Fs(A,B)| = O((s/g)2) ≤ k210. By the remark in Section 5.4.2, we may10g is large enough.

248

assume that the above holds with equality, namely,

dout(Cs) = 10k · level(s) + d+ 8csampg. (5.33)

Thus, for every i, j ∈ {0, g, 2g, . . . , k}, if level(i) = level(j) then dout(Ci) = dout(Cj). Further-

more, if level(i) > level(j) then dout(Ci) ≥ dout(Cj) + 10k. To complete the proof, note that

Equation (5.33) implies thatC respects the out-function d′out(i) = dout(i) + 8csampg.

Claim 5.8.2. For every pairA,B of (k, g,w)-LMRs,

∥⟨A · B⟩ − ⟨A⟩⟨B⟩∥max ≤ (k3 + 8w)2−k/2ϑ(A)ϑ(B).

Proof. Write C = A · B = ((1, C0), (cg, Cg), . . . , (ck, Ck)). By Claim 5.4.4 and Claim 5.4.2, for

every s for whichFs(A,B) = ∅,

⟨Cs⟩ =⟨glue

(ai(D)bj(D)


)⟩=

1|Fs(A,B)|

∑D∈Fs(A,B)

⟨ai(D)bj(D)

msD⟩

=1

|Fs(A,B)|∑

D∈Fs(A,B)

ai(D)bj(D)ms

⟨D⟩.

Recall that cs = |Fs(A,B)| ·ms and so

cs⟨Cs⟩ =∑

D∈Fs(A,B)

ai(D)bj(D)⟨D⟩.

Thus, if we denoteF≤k(A,B) = ∪ks=0Fs(A,B) then

⟨C⟩ =k∑

s=0

cs⟨Cs⟩ =∑

D∈F≤k(A,B)

ai(D)bj(D)⟨D⟩.

249

Note that, by linearity, ∑D∈F(A,B)

ai(D)bj(D)⟨D⟩ =∑i,j

aibjSi,j,

and so, if we denoteF>k(A,B) = F(A,B) \ F≤k(A,B) then

⟨C⟩ −∑i,j

aibjSi,j =∑

D∈F>k(A,B)

ai(D)bj(D)⟨D⟩.

As |F>k(A,B)| ≤ |F(A,B)| ≤ k3 (because there are k different values of s and as we saw before

|Fs(A,B)| ≤ k2) and since, byClaim 5.4.1, ∥D∥max ≤ ∥D∥∞ ≤ 2−k/2 for everyDwith σ(D) >

k, we have that

∥∥∥⟨C⟩ −∑i,j


≤∥∥∥ ∑D∈F>k(A,B)

ai(D)bj(D)⟨D⟩∥∥∥max

≤∑

D∈F>k(A,B)

ai(D)bj(D)∥D∥max

≤ k3ϑ(A)ϑ(B)2−k/2.

The proof then follows as by Claim 5.7.1,

∥∥∥⟨A⟩⟨B⟩ −∑i,j


≤ 8wϑ(A)ϑ(B)2−k.

Claim 5.8.3. LetA = ((1,A0), . . . , (ak,Ak)),B = ((1,B0), . . . , (bk,Bk)) be a pair of (k, g,w)-

LMRs that respect the weight-function ϑ(s) = (s/g)(3s/g)t11 for some t ≥ 0. Then,A ·B respects the

weight-function c′(s/g)(3s/g)(t+1)), where c′ is a large enough constant.

Proof. WriteC = A · B = ((c0, C0), . . . , (ck, Ck)) (c0 = 1). Let s ≥ g. Recall that

cs = |Fs(A,B)| ·max(ai(D)bj(D) | D ∈ Fs(A,B)

).

11Implicitly we are assuming that ϑ(0) = 1.

250

By inspecting the MBSs in Definition 5.7.2, one can see that i(D) + j(D) ≤ s for every D ∈

Fs(A,B). Moreover, by Claim 5.7.10, |Fs(A,B)| = O((s/g)2). We assume, for simplicity, that

the bound is c′(s/g)3 where c′ is a large enough constant. Thus,

cs ≤ c′(s/g)3max (ϑ(i, t)ϑ(j, t) | i+ j ≤ s)

≤ c′(s/g)3max((i/g)(3i/g)t(j/g)(3j/g)t | i+ j ≤ s

)≤ c′(s/g)3max

((s/g)(3i/g)t(s/g)(3j/g)t | i+ j ≤ s

)= c′(s/g)3(s/g)(3s/g)t

≤ c′(s/g)3(s/g)(t+1),

where for the last inequality we used the fact that s ≥ g.

We summarize the results obtained so far in the following proposition.

Proposition 5.8.1. Let k, g,w be integers where k ≥ g ≥ 10(ω + log k)12. Let A =

((1,A0), . . . , (ak,Ak)), B = ((1,B0), . . . , (bk,Bk)) be a pair of (k, g,w)-LMRs. Assume that

bothA,B respect (dout, din, μ, ϑ), where

dout(s) = 10k · level(s) + d,

din(s) = cins log s,

μ(s) = 2s/g,

ϑ(s) = c′t(s/g)(3s/g)t.

for some d, t ≥ 0 and the constant cin is as defined in Claim 5.7.11. Then,A ·B is a (k, g,w)-LMR

that respects (d′out, din, μ, ϑ′) where

d′out(s) = dout(s) + 8csampg,

ϑ′(s) = c′t+1(s/g)(3s/g)(t+1).

12Actually, we assume k ≥ c · g, where c is a large enough constant.

251

Moreover,

∥⟨A · B⟩ − ⟨A⟩⟨B⟩∥max ≤ (k3 + w)(k/g)(8k/g)(t+1)2−k/2.

Proof. WriteC = A·B = ((1, C0), . . . , (ck, Ck)). Asdout, μ satisfy the hypothesis ofClaim5.8.1,

the fact thatA,B are LMRs implies thatC is also an LMR, and thatC respects the out-function

d′out and the magnitude-function μ. By Claim 5.7.11, for every D ∈ Fs(A,B) it holds that

din(D) ≤ din(s). Recall that

Cs = glue

(ai(D)bj(D)


).

Therefore, by Claim 5.4.4, din(Cs) = max (din(D) | D ∈ Fs(A,B)) , which is bounded above

by din(s), as desired. The assertion that C respects the weight-function ϑ′ readily follows by

Claim 5.8.3. Lastly, by Claim 5.8.2 (and assuming k/g ≥ c′),

∥⟨C⟩ − ⟨A⟩⟨B⟩∥max ≤ (k3 + 8w)ϑ(A)ϑ(B)2−k/2

≤ (k3 + w)k(8k/g)(t+1)2−k/2.

5.8.1 Multiplying a sequence of LMRs

We start by introducing some notation. Let A be aw×w stochastic matrix. Let A be a stochastic

matrix approximating A such that ∥A−A∥∞ ≤ ε2n andA = Ej∼[poly(wn/ε)] Ajwhere eachAj is a 0-

1 stochastic matrix13. We define a sequence of poly(wn/ε) (0,w)-matrix bundles Aj = ((1,Aj));

the (O(log(wn/ε), 0,w)-MBS A = (A1,A2, ...) and the matrix representation canon(A) =

((1,A)). Note that A is thin and ⟨canon(A)⟩ = A. Moreover, we may regard canon(A) as a

(k, g,w)-LMR for any k, g ≥ 1.

Let h ≥ 0 be an integer andwriten = 2h. Wewant to approximate the product ofn stochastic

13Such an approximation canbe easily foundby truncating each entry of A toO(log(wn/ε)) bits after the decimalpoint.

252

matrices A1, A2, ..., An. Let A1, . . . ,An be the corresponding sequence of w × w stochastic ma-

trices as defined above. Firstly, ∥∏n

i=1 Ai −∏n

i=1 Ai∥max ≤ ε/2. Next, we approximate∏n

i=1 Ai.

Let T be the complete rooted binary tree of depth h. We label every node u of T by a matrix rep-

resentation, which we denote by Au. The i’th leaf of the tree, counting from the left, is labeled

by canon(Ai). Then, inductively over the depth, if u is the parent of the nodes v,w, we define

Au = Av · Aw. For a node u in T , define Au to be the product of all matrices that correspond to

the matrices associated to the leaves in the subtree rooted by u.

Claim 5.8.4. For every ℓ ≥ 0 and every node u of height ℓ in T it holds that

∥⟨Au⟩ − Au∥max ≤ (k3 + w)(k/g)(8k/g)(ℓ+1)2−k/2.

Moreover,Au is an LMR that respects (dout, din, μ, ϑ) where

dout(s) = 10k · level(s) + 8csampgℓ;

din(s) = cins log s;

μ(s) = 2s/g;

ϑ(s) = c′l(s/g)(3s/g)ℓ.

Proof. The proof is by a straightforward induction. The base case ℓ = 0 readily holds (as k ≥

c4 log(wn/ε) for large enough constant c4). As for the inductive step, the fact that the respective

matrix representation is an LMR that respects (dout, din, μ, ϑ) as defined above readily follows by

the inductive hypothesis and by Proposition 5.8.1. For a node u, let ε(u) = ∥⟨Au⟩ − Au∥max.

Let u be a node in level ℓ > 0 and v,w its left and right children, respectively. Then,

ε(u) = ∥⟨Au⟩ − Au∥max

= ∥⟨Av · Aw⟩ − AvAw∥max

≤ ∥⟨Av · Aw⟩ − ⟨Av⟩⟨Aw⟩∥max + ∥⟨Av⟩⟨Aw⟩ − AvAw∥max

253

≤ (k3 + w)(k/g)(8k/g)(ℓ)2−k/2 + ∥⟨Av⟩⟨Aw⟩ − AvAw∥max, (5.34)

where the last inequality follows by Proposition 5.8.1 and by the induction hypothesis. As for

the second summand in Equation (5.34),

∥⟨Av⟩⟨Aw⟩ − AvAw∥max ≤ ∥⟨Av⟩⟨Aw⟩ − AvAw∥∞

≤ ∥⟨Av⟩⟨Aw⟩ − ⟨Av⟩Aw∥∞ + ∥⟨Av⟩Aw − AvAw∥∞

≤ ∥Av∥∞∥⟨Aw⟩ − Aw∥∞ + ∥Aw∥∞∥⟨Av⟩ − Av∥∞. (5.35)

Consider the first summand. As Av is stochastic, we have that

∥Av∥∞∥⟨Aw⟩ − Aw∥∞ = ∥⟨Av⟩ − Av + Av∥∞∥⟨Aw⟩ − Aw∥∞

≤ (∥⟨Av⟩ − Av∥∞ + ∥Av∥∞) ∥⟨Aw⟩ − Aw∥∞

= (∥⟨Av⟩ − Av∥∞ + 1) ∥⟨Aw⟩ − Aw∥∞

= (ε(v) + 1)ε(w).

As Aw is stochastic, the second summand on the right hand side of Equation (5.35) is bounded

above by ε(v). Thus,

∥⟨Av⟩⟨Aw⟩ − AvAw∥max ≤ (ε(v) + 1)ε(w) + ε(v)

≤ 2(ε(v) + ε(w)).

Plugging this to Equation (5.34), and using the induction hypothesis, we get

ε(u) ≤ 2(ε(v) + ε(w)) + (k3 + w)(k/g)(8k/g)(ℓ)2−k/2

≤ 5(k3 + w)(k/g)(8k/g)(ℓ)2−k/2

≤ (k3 + w)(k/g)(8k/g)(ℓ+1)2−k/2,

254

where the last inequality holds as k ≥ 2g.

As a corollary of Claim 5.8.4 we get that

Corollary 5.8.2. There exist universal constants c1, c2 ≥ 1 such that the following holds. Let n,w

be integers and ε > 0 such that ε < 1/n2. Set

g = c1(log(n) · log

(log(1/ε)log n

)+ logw+ log log(1/ε)

)k = c2 (g+ log(w/ε)) .

Let r be the root of T . Then, ∥∥∥⟨Ar⟩ −n∏i=1

Ai

∥∥∥max

≤ ε/2.

Moreover, writeAr = ((1,A0), (ag,Ag), . . . , (ak,Ak)). Then, for every s ∈ {0, g, . . . , k},

dout(As)+din(As) = O(log(w/ε) log log(w/ε) + log2(n) · log

(log(1/ε)log n

)+ log n · logw

).

Proof. First, we show that Equation (5.29) is satisfied by our choice of k, g. Indeed, by taking any

c2 ≥ 1, we get k ≥ g. Furthermore, by taking c1 ≥ 40, we get that g ≥ 20ω. Therefore, it suffices

to verify that g ≥ 20 log kwhich is guaranteed to holds assuming c1 ≥ 40.

By Claim 5.8.4 applied to the root r of T ,

∥∥∥⟨Ar⟩ −n∏i=1

Ai

∥∥∥max

≤ (k3 + w)(k/g)(8k/g)(log n+1)2−k/2.

First, we show that

(k/g)(8k/g)(log n+1) ≤ 2k/4. (5.36)

By rearranging, it suffices to show that

g ≥ 32(log(n) + 1) log(k/g). (5.37)

255

Now,

kg=

c2(g+ log(w/ε))g

= c2(1+

log(w/ε)g

).

As ε < 1/n2, g ≥ c1(logw+ log n), and so

kg≤ c2

(1+

log(w/ε)c1(logw+ log n)

)≤ c2

(1+

log(1/ε)log n

)≤ 2c2 log(1/ε)

log n. (5.38)

Hence, to prove Equation (5.37), it suffices to show that

g ≥ 32(log(n) + 1) log(2c2 log(1/ε)

log n

).

The above equation holds assuming that

c132

· log(log(1/ε)log n

)≥ log

(2c2 log(1/ε)

log n

),

which holds by choosing the constants c1, c2 such that c1 ≥ 64 + 32 log c2, which is consistent

with the restrictions imposed so far.

Now that we proved Equation (5.36), we have that

∥∥∥⟨Ar⟩ −n∏i=1

Ai

∥∥∥max

≤ (k3 + w)2−k/4.

For large enough k, the RHS is bounded by w2−k/5. As k ≥ c2 log(w/ε), by taking c2 ≥ 10 we

get w2−k/5 ≤ ε/2, as desired.

256

Moving to the moreover part, by Claim 5.8.4, for every s ∈ {0, g, 2g, . . . , k},

dout(As) = 10k · level(s) + 8csampg log n

= O(k log k+ g log n).

Note that

log(n) · log(log(1/ε)log n

)= O(log(1/ε)),

and so k = O(log(w/ε)),which yields

dout(As) = O(log(w/ε) log log(w/ε) + g log n)

= O(log(w/ε) log log(w/ε) + log2(n) · log

(log(1/ε)log n

)+ log n · logw

).

Note that din is dominated by dout as din(As) = O(s log s) = O(dout(As)).


In this section we deduce Theorem 5.3.1.

Proof of Theorem 5.3.1. The pseudo-distribution D is induced in a natural way from the multi-

plication rule between LMRs. As in Section 5.1.1, for any width-w, length-n branching program

P, we can represent the transition between a pair of consecutive layers Pt, Pt+1 of the program by

aw×w stochastic matrixMt, which is an average of two 0-1 stochastic matricesM0t andM1

t rep-

resenting the transitions when the tth bit is 0 and 1 respectively. And as sparsification of matrix

product gave us a PRG for ROBPs, the above mentioned process of multiplying a sequence of

LMRs gives us a PRPG for ROBPs.

To be more precise, let A = ((1,A0), . . . , (ak,Ak)) be the final LMR at the root of the tree

described in Section 5.8.1 while multiplying the matrices M1, ...,Mn. ∀ i ∈ {0, 1, ..., k}, let

Ai = (Ai,1, ...,Ai,2dout(Ai)) and ∀ j ∈ [2doutAi ], Ai,j = ((αi,j,1,Ai,j,1), ..., (αi,j,2din(Ai) ,Ai,j,2din(Ai))).

It’s easy to see that because we started with 0-1 stochastic matricesM0t ,M1

t in the matrix bundles

257

at the leaves, ∀ i, j,m, Ai,j,m is a 0-1 stochastic matrix and corresponds to a single n-length path

in the branching program, say, pi,j,m. This can be seen by induction on the levels of the tree; the

innermost matrices at level l corresponds to paths of length 2l.

Thus, the PRPG is the sequence ((ρi,j,m, pi,j,m))i∈{0,1,...,k},j∈[2dout (Ai)],m∈[2din (Ai)]where ρi,j,m =

ai· 12dout(Ai) ·αi,j,m. By following thematrix products, it’s easy to see that theweights and coefficients

depend only on the types of products used (starting with all coefficients being 1 at the leaves) and

not on the entries ofM0t andM1

t andhence, PRPG is input-oblivious anddoes not dependon the

ROBP. Next, each bit of the string, corresponding to the path pi,j,m, can be computed knowing

the definitions of the matrix products and corresponding samplers and inductively going down

the tree (the information can be calculated from the indices i, j,m).

As the samplers that we use for the product between LMRs are log-space

computable (log in the size of the bipartite graph), one can see that the D is

O(log(w/ε) log log(w/ε) + log2(n) · log

(log(1/ε)log n

)+ log n · logw

)-space computable.

The seed length, which is given by,

log

(k∑

i=0

2din(Ai)+dout(Ai)

)≤ din(Ak) + dout(Ak) + log k

readily follows by Corollary 5.8.2.

As for the bound on the weights of D, note that the ρi’s in D are obtained by multiplying the

weights of A with the coefficients of the MBSs composing A. It is easy to verify that the coeffi-

cients are all bounded above by 1 in absolute value. Therefore, it suffices to bound the weights of

A. By Claim 5.8.4, ϑ(k) ≤ c′ log(n)(k/g)(3k/g) log n, and so

log ϑ(k) ≤ 3k log ng

log(k/g) + c′ log(n)

≤(log n+

log n · log(w/ε)g

)3c2 log(k/g) + c′ log(n).

258

By Equation (5.38), k/g ≤ 2c2 log(1/ε)/ log n. Let t = log(

2c2 log(1/ε)log n

).Then,

log ϑ(k) ≤ 3c2(log n+

log n · log(w/ε)g

)t+ c′ log(n)

= O(log(1/ε) +

log n · log(w/ε) · tg

)

As g = Ω(t log n+ logw), we have that

log ϑ(k) = O(log(1/ε) + log (w/ε)) = O(log(w/ε)),


259

Without leaps of imagination, or dream-

ing, we lose the excitement of possibilities.

Dreaming, after all, is a form of planning.

Gloria Steinem

6Space Complexity of the Mirror Game

The results in this chapter are based on joint work with Jon Schneider [GS18]. In this chapter,

we focus on the implications of space-bounded computation for mirror games.

6.1 TheMirror Game andMirror Strategies

Consider the following simple game. Alice and Bob take turns saying numbers belonging to

the set {1, 2, . . . ,N}. If either player says a number that has previously been said, they lose.

Otherwise, after N turns, all the numbers in the set have been spoken aloud, and both players

win. Alice says the first number.

IfN is even, there is a very simple and computationally efficient strategy that allows Bob towin

260

this game, regardless of Alice’s strategy: whenever Alice says x, Bob replies withN+ 1− x. This

is an example of amirror strategy (and for this reason, we refer to the game above as themirror

game). Mirror strategies are an effective tool for figuring out who wins in a variety of combina-

torial games (for example, two-pile Nim [BCG03]). More practically, mirror strategies can be

applied when playing more complex games, such as chess or go (to varying degrees of success).

From a computational perspective, mirror strategies are interesting as they require very limited

computational resources - most mirror strategies can be described via a simple transformation of

the preceding action.

Returning to the game above, this leads to the following natural question: does Alice have a

simple strategy to avoid losing whenN is even? Since both players have access to the same set of

actions, one may be tempted to believe that the answer is yes - in fact, ifN is odd, then Alice can

start by saying the number N and then adopt the mirror strategy for Bob described above for a

set ofN− 1 elements. However, whenN is even, the mirror strategy as stated does not work.

To answer the above question, we need to formalize what we mean by simple; after all, Al-

ice has plenty of strategies such as “say the smallest number which has not yet been said”. One

useful metric of simplicity is the metric of space complexity, the amount of memory a player re-

quires to implement their strategy (we formalize this in Section 6.1.1). Note that Bob needs only

O(logN) bits of memory to implement his mirror strategy, whereas the naive strategy for Alice

(remembering everything) requiresO(N) bits.

In this chapter, we show that this gap is necessary; any successful, deterministic strategy for

Alice requires at least Ω(N) bits of memory:

Theorem 6.1.1. [Restatement of Theorem 6.4.1] If N is even, then any winning strategy for Alice

in the mirror game requires at least (log2 5− 2)N− o(N) bits of space.

6.1.1 Bounded-Memory Strategies

With unbounded memory, it is clear that both players can easily win the mirror game. We there-

fore restrict our attention to strategieswith boundedmemory. A strategy computable inmemory

261

m for Alice in themirror game is defined by an initial memory state s0 ∈ [2m] and a pair of transi-

tion functions f : [N]× [2m]× [N] → [2m] and g : [2m] → [N] computable in SPACE(m) (see

[AB09] for an introduction to bounded space complexity) (Recall, [N] = {1, 2, . . . ,N}). The

function f takes in the previous reply b ∈ [N] of Bob, Alice’s current memory state s ∈ [2m], and

the current turn t ∈ [N], and returns Alice’s new memory state s′ ∈ [2m]; the function g takes

in Alice’s current memory state s ∈ [2m] (after updating it based on Bob’s move), and outputs

her next move a. We say a strategy for Alice is a winning strategy if Alice is guaranteed to win

regardless of Bob’s choice of actions.

By removing the constraint that the transition function f is computable in SPACE(m), we

obtain a larger class of strategies, whichwe refer to as strategies weakly computable inmemorym.

The lower bounds in Theorems 6.4.1 and 6.5.2 continue to hold for this larger class of strategies.

6.2 Eventown andOddtown

While many tools exist in the literature for showing space lower bounds (e.g. communication

complexity, information theory), one interesting feature of this problemabsent frommanyothers

is that any proof of Theorem 6.1.1 must depend crucially on the parity ofN.

In the study of set families in extremal combinatorics, an “Oddtown” is a collection of subsets

of {1, 2, . . .N}where every subset has even cardinality but each pair of distinct subsets has an in-

tersection of odd cardinality. Likewise, an “Eventown” is a collection of subsets of {1, 2, . . . ,N}

where every subset has even cardinality but each pair of distinct subsets has an intersection of

even cardinality. In 1969, Berlekamp [Ber69], answering a question of Erdos, showed that while

there exist Eventowns containing up to 2N/2 subsets, any Oddtown contains at mostN subsets.

It turns out that this exponential gap between the size ofOddtowns and the size of Eventowns

is directly responsible for the exponential gap between the space complexity of Alice’s strategy

and the space complexity of Bob’s strategy. One way to see this connection is that, just as Bob’s

O(logN)-space strategy involves pairing up the numbers of {1, 2, . . . ,N}, one way to construct

an Eventown of size 2N/2 is to perform a similar pairing, and then consider all subsets formed by

262

unions of pairs.

The Eventown-Oddtown theorem figures into the proof of Theorem 6.1.1 in the following

way. At a given turn, for each possible state of memory of Alice, label the state with the possible

subsets of numbers that could have possibly been said before this turn. We show (Lemma 6.4.4)

that these subsets must form an Oddtown, or else Bob has some strategy that can force Alice to

lose. Since there are a large (exponential) number of total possible subsets (Corollary 6.4.2), and

since each Oddtown contains at mostN subsets, this implies that Alice’s memory must be large.

Next, we review the known literature on the Eventown-Oddtown problem. Note that while

in the introduction we only defined the terms “Eventown” and “Oddtown”, there are actually

four different classes of set system depending on the parity of cardinalities of the subsets and the

parity of the cardinality of the intersection.

Definition 6.2.1. A collection of subsetsF ⊆ [N] forms an (Odd, Even)-town of sets if:

1. For every F ∈ F , |F| ≡ 1 mod 2.

2. For every F1 = F2 ∈ F , |F1 ∩ F2| = 0 mod 2.

We define (Odd, Odd)-towns, (Even, Odd)-towns, and (Even, Even)-towns similarly.

Note that there exist (Even, Even)-towns and (Odd, Odd)-towns containing exponentially (in

N) many sets; one simple construction of an (Even, Even)-town is to partition the ground set

into pairs (possibly with a leftover element), and consider all sets formed by taking unions of

these pairs. In contrast, (Even, Odd)-towns and (Odd, Even)-towns each contain at mostN sets.

Lemma 6.2.1. Any (Odd, Even)-townF has size at most N i.e. |F| ≤ N. [Ber69, Sza11]

For completeness, we give the proof below.

Proof. Let F be {F1, F2, ..., Fk}. Let vi ∈ {0, 1}N be the characteristic function of Fi i.e.

vi[l] = 1 ⇐⇒ l ∈ Fi. We know that vTi vj mod 2 = 1 for i = j and 0 otherwise. We

claim that v1, v2, ..., vk are linearly independent over F2 (where F2 is the field with 2 elements)

263

and hence, k ≤ N. We prove this claim by contradiction. If they are not linearly independent,

then ∃λ1, λ2, ..., λk ∈ {0, 1} not all zero such that∑k

i=1 λivi = 0 mod 2. Without loss of

generality, assume λ1 = 0. As vT1 (∑k

i=1 λivi) = λ1 over F2, this implies λ1 = 0; a contradiction.

Corollary 6.2.2. Any (Even, Odd)-town contains at most N sets.

Proof. We prove an upper bound ofN + 1 sets by reduction to Lemma 6.2.1. Given an (Even,

Odd)-town B = {B1, ...,Bm}, construct an (Odd, Even)-town in space [N + 1] from sets B1 ∪

{N+1},B2∪{N+1}, ...,Bm∪{N+1}. Using Lemma 6.2.1, we getm ≤ N+1. For a stronger

upper bound ofN, refer to [Ber69, Sza11].

6.3 Randomized Strategies for Alice

A natural followup to Theorem 6.1.1 is whether these lower bounds continue to hold if Alice

instead uses a randomized strategy, which only needs to succeed with high probability.

We provide some evidence in [GS18] to show that this might not be the case. We demonstrate

anO(√N log2 N)-space algorithm for Alice that succeeds with probability 1− O(1/N), as long

as Alice is provided access to a uniformly chosen perfect matching on the complete graph onN

vertices (KN) In addition, even with O(logN)-space (and access to a uniformly chosen perfect

matching onKN), Alice can guarantee success with probability Ω(1/N) (See [GS18] for the de-

tails). In fact, [Fei19]modifiedAlice’s randomized strategy of [GS18] so as to use onlyO(log3N)

memory bits.

In both of the algorithmsmentioned above, Alice attempts themirror strategy of Bob, hoping

that Bob does not choose the number with nomatch. In theO(√N log2 N) algorithm, Alice de-

creases her probability of failure bymaintaining a set of O(√N) possible “backup” points she can

switch to if Bob identifies the unmatched point. This lets Alice survive until turnN− O(√N),

whereuponAlice can reconstruct the remainingO(√N) elements bymaintainingO(

√N)power

sums during the computation.

264

Since Alice cannot store a perfect matching onKN in o(N) space, this unfortunately does not

give any o(N)-space strategies for Alice. In addition, the lower bound techniques from Theo-

rem 6.1.1 fail to give non-trivial guarantees in the randomizedmodel. Demonstrating non-trivial

algorithms or non-trivial lower bounds for the general case is an interesting open problem.

6.4 Deterministic Strategies of Alice Require Linear Space

In this sectionweprove thatAlice requires linear space towin themirror gamewhenN is even. To

do this, we will show that if Alice is at a specific memory state, then the possible sets of numbers

that have been said till that point in time must form an (Even, Odd)-town (and hence there are

at mostN such sets for any memory state). Before we get into the main proof, we will define and

lower-bound the size of what we call a “covering collection” of subsets (this will later allow us to

lower bound the total number of possible sets of numbers said by round t (2t turns)).

Definition 6.4.1. For all integers p, r > 0, a collection of subsets C of [N] is (p, r)-covering if:

1. each S ∈ C has |S| = pr.

2. for every T ⊂ [N] with |T| = r, there exists a S ∈ C with T ⊂ S.

Lemma 6.4.1. Every (p, r)-covering collection C has size at least (Nr)(prr)

.

Proof. Every set in this collection contains at most(prr

)sets T with cardinality r. There are

(Nr

)possible sets T.

When p = 2, it turns out that the lower bound in Lemma 6.4.1 is maximized when r = N/5.

Corollary 6.4.2. When r = N/5, every (2, r)-covering collection has size at least 2(log2 5−2)N−o(N).

Theorem 6.4.1. If N is even, then any winning strategy in the mirror game for Alice requires at

least (log2 5− 2)N− o(N) bits of space.

265

Proof. Fix a winning strategy for Alice. Assume this strategy uses m bits of memory, and thus

hasM = 2m distinct states of memory.

Call a subset S of [N] r-occurring if it is possible that immediately after turn 2r (r rounds), the

set of numbers that have been said is equal to S. Let Sr be the collection of all r-occurring sets.

Before diving into the main proof, for any fixed deterministic strategy of Alice, we prove a lower

bound onSr i.e. the number of different subsets of numbers that could have been said in the first

2r turns over various strategies of Bob.

Lemma 6.4.3. Sr is (2, r)-covering.

Proof. Since 2r numbers have been said immediately after turn 2r, every set in Sr has cardinality

2r. We must show that for any T ⊂ [N]with |T| = r, that there exists an S in Sr with T ⊂ S.

Consider the following strategy for Bob: “say the smallest number inTwhich has not yet been

said”. Note that if Bob follows this strategy, then the set of numbers said by turn 2rmust contain

the entire set T. This set belongs to Sr, and it follows that Sr is (2, r)-covering.

WriteN = 2n, and fix a value r ∈ [n]. For a memory state x out of theM possible memory

states and an r-occurring set S, label xwith S if it is possible that Alice is at memory state xwhen

the set of numbers that have been said is equal to S. Each state of memory may be labeled with

several or none r-occurring sets, but each r-occurring set must exist as a label to some state of

memory (by definition). Let Ux be the collection of r-occurring labels for memory state x. We

want to upper bound the size of Ux. Following lemma along with (Even, Odd)-town Lemma

(Corollary 6.2.2) helps us in doing exactly that.

Lemma 6.4.4. If S1 and S2 belong to Ux, then |S1 \ S2| is odd.

Proof. LetD = S1ΔS2, letD1 = S1 \ S2, and letD2 = S2 \ S1. Assume to the contrary that |D1|

is even. Note then that |D2| is also even (since |S1| = |S2| = 2r) and that |D| = 2|D1|. We’ll

consider two possible cases for the state of the game after turn 2r: 1. Alice is at state x, and the set

of numbers that have been said is S1, and 2. Alice is at state x, and the set of numbers that have

been said is S2.

266

Consider the following strategy that Bob can play in either of these cases: “say the smallest

number that has not been said and is not inD”. We claim that if Bob uses this strategy, Alice will

be the first person (after turn 2r) to say an element ofD. Note that Bob will not say an element

ofD until after turn 2n− |D|/2; this is since:

1. If we are in case 1, then all of the numbers inD1 have been said but none of the numbers in

D2 have been said. There are therefore |D2| = |D|/2 numbers inDwhich have not been

said, so Bob can avoid saying an element ofD till turn 2n− |D|/2.

2. Likewise, if we are in case 2, the argument proceeds symmetrically.

On the other hand, if no element of D is spoken by either player between turn 2r and turn

2n − |D|/2, then at turn 2n − |D|/2 + 1, the only remaining elements belong toD (the set of

remaining elements is either D1 or D2, depending on which case we are in). If |D1| = |D|/2 is

even, then it is Alice’s turn to speak at turn 2n− |D|/2+ 1. It follows that Alice will be the first

person after turn r to say an element ofD.

Let y1 be the memory state of Alice when she first speaks an element ofD in case 1, and define

y2 similarly. We claim that y1 = y2. Indeed, since Alice’s strategy is deterministic and starts

from x in both cases 1 and 2, Bob’s strategy plays identically in both case 1 and case 2 until an

element of D has been spoken. It follows that Alice must speak the same element of D in both

cases. But if this element is inD1, and they are in case 1, then this element has already been said

before; similarly, if this element is in D2, and they are in case 2, then this element has also been

said before. Regardless of which element inDAlice speaks at this state, there is some case where

she loses, which contradicts the fact that Alice’s strategy is successful. It follows that |D1|must

be odd, as desired.

Claim 6.4.1. |Ux| ≤ N

Proof. We claim the sets in Ux form an (Even, Odd)-town, from which this conclusion follows

(Corollary 6.2.2). Each set in Ux has cardinality 2r, so all sets have even cardinality. By Lemma

267

6.4.4, any pair of distinct sets S1, S2 ∈ Ux has odd |S1 \S2|. Note that since |S1∩S2| = |S1|−|S1 \

S2|, and since |S1| is even, it follows that |S1∩S2| is odd, so any pair of sets have an odd cardinality

intersection.

Choose r = N/5. By Corollary 6.4.2, Sr must have cardinality at least 2(log2 5−2)N−o(N). Since

each element inSr belongs to at least one collectionUx, and since each collectionUx has cardinality

atmostN (Claim 6.4.1), the numberM ofmemory states x is at least 2(log2 5−2)N−o(N)/(N+1) =

2(log2 5−2)N−o(N). It follows thatm = logM ≥ (log2 5 − 2)N − o(N), as desired. This proves

Theorem 6.4.1.

6.5 The (a, b)-Mirror Game

Consider the following generalization of themirror game, where Alice says a numbers each turn,

and Bob says b numbers each turn (from the set [N]). We call this new game the (a, b)-mirror

game. As with the regular (1, 1)-mirror game, mirror strategies exist for the class of (1, b)-mirror

games (and similarly, the (a, 1)-mirror games).

Theorem 6.5.1. IfN is divisible by b+1, then Bob has awinning strategy computable inO(logN)

memory for the (1, b)-mirror game.

Proof. Divide theN elements of [N] intoN/(b+ 1) consecutive (b+ 1)-tuples. Any time Alice

says an element in a (b+ 1)-tuple, Bob says all the remaining elements in that (b+ 1)-tuple.

Unlike in the (1, 1)-mirror game, we cannot rule out the existence of low space winning strate-

gies for other games and other choices ofN. In [GS18], we consider low-memorywinning strate-

gies for (a, b)-mirror games where a + b is a prime, p. We obtain the following theorems, using

a simple generalization of Theorem 6.4.1(See [GS18] for the proofs).

Theorem 6.5.2. Let p be a prime. If N is not divisible by p, then any winning strategy for Bob in

the (1, p− 1)-mirror game requiresΩ(N) space.

268

Thus, the above theorem, along with Theorem 6.5.1 fully characterize the N where Bob has

a low space winning strategy, in the (1, p − 1)-mirror game. When a > 1, the proof techniques

no longer allow us completely characterize the set ofN where Bob requires Ω(N) space to win.

Instead, we only have the following partial characterization.

Theorem 6.5.3. Let p be a prime. If a+ b = p, andN mod p ∈ {a, a+ 1, . . . , p− 1}, then any

winning strategy for Bob in the (a, b)-mirror game requiresΩ(N) space.

6.6 Open Problems

Our understanding of the space complexity aspects of mirror games is still very rudimentary. We

conclude by listing some open problems we find interesting.

1. When do low-space winning strategies exist? Does either Alice or Bob ever have a low-

space winning strategy for any (a, b) game when a and b are both larger than 1 (e.g., the

(2, 2) game)? Are there any cases where the best deterministic winning strategy has space

complexity strictly between O(logN) and O(N) (e.g. O(√N)), or does the best winning

strategy always fit into one of these two extremes?

2. Beating low-space with low-space. In order to show that Alice needs Ω(N) space, our

adversarial strategy for Bob also requires Ω(N) space. Given a deterministic low-memory

m strategy for Bob, is there a low-memory strategy for Alice which wins against it?

3. Lower bounds for randomized strategies. Canwe prove any sort of lower bound against

randomized strategies that win with high probability?

269

At the end of the day, we can endure much

more than we think we can.

Frida Kahlo

7Conclusion

The focus of this thesis is understanding the limitations of space-bounded computation. Is learn-

ing infeasible under memory constraints? Is it easy to fool space-bounded computation? Do all

the players have low-memory winning strategies in mirror games? The contributions of this the-

sis can be briefly summarized as follows.

1. An extractor-based approach to proving in-feasibility of learning under memory con-

straints, for a large class of learning problems (Chapter 2).

2. Techniques to proving in-feasibility of learning undermemory constraints, evenwhen the

learner is allowed a second pass over the samples (Chapter 3).

270

3. Memory-sample trade-offs for certain distinguishing problems and proving security of

Goldreich’s local pseudorandom generator, against memory-bounded adversaries, in the

streaming model (Chapter 4).

4. Pseudorandom pseudo-distributions for read-once branching programs with seed length

that has near-optimal dependence on error (Chapter 5).

5. Proving that Alice needs Ω(n)memory to win the following mirror game: Alice and Bob

take turns saying numbers in the set {1, 2, . . . , 2n}. A player loses if they repeat a number

that has already been said. Otherwise, after 2n turns, both players win (Chapter 6).

The recent progress towards proving the in-feasibility of learning under memory constraints,

starting with [Raz16], has also given us a better understanding of the limitations of space-

bounded computation in other contexts. For example, result of [Raz16] can be used to construct

pseudorandom generators that fool space-bounded computations (albeit without improving the

state of the art). Using the learning lower bounds, [Raz16, GZ19] greatly improved upon the

constructions of cryptographic protocols that are secure against memory-bounded adversaries.

But, we are still far from understanding the landscape of bounded-memory learning. For exam-

ple, learning without memory constraints can be characterized by VC dimension [BEHW89]. Is

there a similar characterization for learning with bounded-memory? [GRT18] provides a suffi-

cient condition for proving in-feasibility of learning under memory constraints, but there hasn’t

been much work towards a full characterization of bounded-memory learning (except for a re-

cent work of [GLM20]). We hope that the techniques developed in this thesis further help our

understanding of space-bounded computation in various contexts.

271

AAppendix for Chapter 3

Wefirst state themain theorem of [GRT18] and then themodified proposition used in the proof

of Lemma 3.4.4.

Theorem A.0.1 (Theorem 1, [GRT18]). Let 1100 < c < 2


Let X, A be two finite sets. Let n = log2 |X|. Let M : A × X → {−1, 1} be a matrix which is a

(k′, ℓ′)-L2-extractor with error 2−r′ , for sufficiently large1 k′, ℓ′ and r′, where ℓ′ ≤ n. Let

r := min{

r′2 ,

(1−γ)k′2 ,

(1−γ)ℓ′2 − 1

}.

Let B be a branching program of length atmost 2r andwidth atmost 2c·k′·ℓ′ for the learning problem1k′, ℓ′, r′ are larger than some constant that depends on γ.

272


The authors prove the above theorem by first defining a truncated path that stops on a signifi-

cant vertex, a significant value or a bad edge, such that, if the path doesn’t stop before reaching a

leaf, then the probability of guessing the correct x is small (at mostO(2−r) to be precise). Then,

the authors prove that the probability that the truncated path stops is at mostO(2−r). Through

slight modifications to the proof of the above theorem (with weaker bounds on thememory and

length of B, in terms of constants), we can prove that the probability that a slightly modified

truncated path stops is at most 2−Ω(min{k′,ℓ′}). As the modified proof is very similar to that of

Theorem A.0.1 and the original proof is lengthy, we just highlight the changes to the proof to

get the following proposition.

Proposition A.0.1. Let X, A be two finite sets. Let n = log2 |X|. Let M : A × X → {−1, 1}

be a matrix which is a (k′, ℓ′)-L2-extractor with error 2−r′ , for sufficiently large k′, ℓ′ and r′, where

ℓ′ ≤ n. Let

r :=min {r′, k′, ℓ′}

100.

Let B be a branching program of length atmost 2r andwidth atmost 2 k′·ℓ′100 for the learning problem

that corresponds to the matrixM. Then, there exists an event G such that

Pr[G] ≥ 1− 2−min{k′,ℓ′}

8

and for every x′ ∈ X and every leaf z of the branching program B (with starting vertex z0),

Pr [x = x′ | G ∧ (z0→z)] ≤ 22ℓ′ · 2−n,

whenever the event G∧ (z0→z) is non-empty, where z0→z denotes the event that the computational

path (as opposed to the truncated path) from z0 reaches z.

Proof. The proof of Theorem 1 of [GRT18] defines the truncated-path, T , to be the same as the

computation-path of B, except that it sometimes stops before reaching a leaf. Roughly speaking,

273

T stops before reaching a leaf if certain “bad” events occur. Nevertheless, the proof shows that

the probability that T stops before reaching a leaf is negligible, so we can think of T as almost

identical to the computation-path.


Pr(Ev) the probability for Ev, and we denote by Px|v = Px|Ev the distribution of the random

variable x conditioned on the event Ev.

We first look at the definition of the truncated-path from the proof ofTheorem1 of [GRT18].

We modify the stopping rules for a path as follows:

Let l = ℓ′

6 .

Significant Vertices: A vertex v in layer-i of B is significant if

∥Px|v∥2 > 2l · 2−n.

Significant Values: Even if v is not significant, Px|v may have relatively large values. For a vertex

v in layer-i of B, denote by Sig(v) the set of all x′ ∈ X, such that,

Px|v(x′) > 23l · 2−n.

Bad Edges: For a vertex v in layer-i of B, denote by Bad(v) the set of all α ∈ A, such that,

∣∣(M · Px|v)(α)∣∣ ≥ 2−r′ .

Recall, that the truncated path is defined by induction on the layers of the branching program

B:


Assume that we already defined T until it reaches a vertex v in layer-i of B. The path T stops on

v if (at least) one of the following occurs:

274


2. x ∈ Sig(v).

3. ai+1 ∈ Bad(v).

4. v is a leaf.

Otherwise, T proceeds by following the edge labeled by (ai+1, bi+1) (same as the computational-

path).

The EventG

WedefineG to be the event that the truncated-pathT didn’t stop because of one of the first three

stopping rules: That is, T didn’t stop before reaching a leaf and didn’t violate the significant

vertices and significant values stopping rules (that is, the first two stopping rules) on the leaf that

it reached.

We can upper bound the probability for G similarly to the way that it’s done in [GRT18].

Lemma A.0.2. The probability that T reaches a significant vertex is at most 2−k′ .

The proof of the above lemma is very similar to the analogous lemma in the proof of Theorem

A.0.1. The only change is in the definition of significant value - we define the significant values

to be the set of all x′ ∈ X, such that,Px|v(x′) > 23l ·2−n instead of the set of all x′ ∈ X, such that,

Px|v(x′) > 22l+2r · 2−n. With the above (worse in terms of constants) bounds on the memory

and the length of the branching program, the proof works in the same way.

Lemma A.0.2 shows that the probability that T stops on a vertex, because of the first reason

(i.e., that the vertex is significant), is small. The next two claims imply that the probabilities that

T stops on a vertex, because of the second and third reasons, are also small.

Claim A.0.1. If v is a non-significant vertex of B then

Prx[x ∈ Sig(v) | Ev] ≤ 2−l.

275


Ex′∼Px|v

[Px|v(x′)

]=∑x′∈X

[Px|v(x′)2

]= 2n · E

x′∈RX

[Px|v(x′)2

]≤ 22l · 2−n.


Prx′∼Px|v

[Px|v(x′) > 2l · 22l · 2−n

]≤ 2−l.

Since conditioned on Ev, the distribution of x is Px|v, we obtain

Prx

[x ∈ Sig(v)

∣∣ Ev]= Pr

x

[(Px|v(x) > 2l · 22l · 2−n

) ∣∣ Ev

]≤ 2−l.

Claim A.0.2. If v is a non-significant vertex of B then

Prai+1

[ai+1 ∈ Bad(v)] ≤ 2−k′ .

Proof. Since v is not significant, ∥Px|v∥2 ≤ 2l · 2−n. Since Px|v is a distribution, ∥Px|v∥1 = 2−n.


≤ 2l ≤ 2ℓ′ .

SinceM is a (k′, ℓ′)-L2-extractor with error 2−r′ , there are at most 2−k′ · |A| elements α ∈ Awith

∣∣⟨Mα,Px|v⟩∣∣ ≥ 2−r′ · ∥Px|v∥1 = 2−r′ · 2−n.

The claim follows since ai+1 is uniformly distributed over A.

We can now use Lemma A.0.2, Claim A.0.1 and Claim A.0.2 to prove that the probability

that T stops because of the first three stopping rules is at most 2−min{k′,ℓ′}

8 . Lemma A.0.2 shows

that the probability that T reaches a significant vertex and hence stops because of the first stop-

ping rule, is at most 2−k′ . Assuming that T doesn’t reach any significant vertex (in which case

276

it would have stopped because of the first stopping rule), Claim A.0.1 shows that in each step,

the probability that T stops because of the second stopping rule, is at most 2− ℓ′6 . Taking a union

bound over the 2r steps, the total probability that T stops because of the second stopping rule, is

at most 2−ℓ′7 (for sufficiently large ℓ′). In the sameway, assuming that T doesn’t reach any signif-

icant vertex (in which case it would have stopped because of the first stopping rule), ClaimA.0.2

shows that in each step, the probability that T stops because of the third stopping rule, is at most

2−k′ . Again, taking a union bound over the 2r steps, the total probability that T stops because of

the third stopping rule, is at most 2−k′7 . Thus, the total probability that T stops (for any reason)

before reaching a leaf (or violated the significant vertices or significant values stopping rules (that

is, the first two stopping rules) on the leaf that it reached) is at most 2−min{k′,ℓ′}

8 . (Summing over

the three probabilities and using the fact that k′, ℓ′ are sufficiently large).

Thus, Pr[G] ≤ 2−min{k′,ℓ′}

8 , as required.

Bounding Pr [x = x′ | G ∧ (z0→z)]:

It remains to prove that for every x′ ∈ X and every leaf z of the branching program B (with

starting vertex z0),

Pr [x = x′ | G ∧ (z0→z)] ≤ 22ℓ′ · 2−n,

whenever the eventG∧ (z0→z) is non-empty, where z0→z denotes the event that the computa-

tional path (as opposed to the truncated path) from z0 reaches z.

Recall that Ez is the event that T reaches the vertex z.

Claim A.0.3. The event G ∧ (z0→z) is equivalent to Ez ∧ (z is not significant) ∧ (x ∈ Sig(z)).

Proof. IfG∧ (z0→z) occurs then the truncated-path T didn’t stop before reaching a leaf (since

G occurs) and the computational path from z0 reaches z (since (z0→z) occurs). Thus, Ez occurs.

Also, since the first stopping rule is not violated on z, we have that z is not significant and since

the second stopping rule is not violated on z, we have x ∈ Sig(z).

277

On the other direction, ifEz∧(z is not significant)∧(x ∈ Sig(z)) occurs, then (z0→z) occurs

(since Ez occurs), the truncated-path T didn’t stop before reaching a leaf (since Ez occurs) and

none of the first two stopping rules are violated on z, since z is not significant and x ∈ Sig(z).

By Claim A.0.3, it remains to prove that for every leaf z and every x′ ∈ X, if the event Ez ∧

(z is not significant) ∧ (x ∈ Sig(z)) is non-empty then

Pr [x = x′ | Ez ∧ (z is not significant) ∧ (x ∈ Sig(z))] ≤ 22ℓ′ · 2−n.

Equivalently, we need to prove that for every non-significant leaf z and every x′ ∈ X, if the event

Ez ∧ (x ∈ Sig(z)) is non-empty then

Px|Ez∧(x ∈Sig(z))(x′) = Pr [x = x′ | Ez ∧ (x ∈ Sig(z))] ≤ 22ℓ′ · 2−n.

By the definition of conditional distribution,

Px|Ez∧(x∈Sig(z))(x′) =

0 if x′ ∈ Sig(z)

Px|Ez(x′) · c−1 if x′ ∈ Sig(z)

where c =∑

x′ /∈Sig(z) Px|Ez(x′) is the normalization factor. As z is not significant, by ClaimA.0.1,

Prx[x ∈ Sig(z) | Ez] ≤ 2−l.

Therefore, c ≥ 1 − 2−l. Since by the definition of Sig(z), for x′ ∈ Sig(z), we have Px|z(x′) ≤

23l · 2−n, we can bound

Px|Ez∧(x ∈Sig(z))(x′) ≤ 23l · 2−n · c−1 ≤ 23l+1 · 2−n ≤ 22ℓ′ · 2−n.

278

References

[AB09] S. Arora and B. Barak. Computational Complexity - A Modern Approach. CambridgeUniversity Press, 2009.

[ABR16] Benny Applebaum, Andrej Bogdanov, and AlonRosen. A dichotomy for local small-bias generators. J. Cryptology, 29(3):577–596, 2016.

[ABSRW04] Michael Alekhnovich, Eli Ben-Sasson, Alexander A. Razborov, and Avi Wigder-son. Pseudorandom generators in propositional proof complexity. SIAM J. Comput.,34(1):67–88, 2004.

[ABW10] Benny Applebaum, Boaz Barak, and Avi Wigderson. Public-key cryptography fromdifferent assumptions. In STOC’10—Proceedings of the 2010 ACM International Sympo-sium on Theory of Computing, pages 171–180. ACM, New York, 2010.

[ADR02] YonatanAumann, Yan ZongDing, andMichael O. Rabin. Everlasting security in thebounded storage model. IEEE Trans. Information Theory, 48(6):1668–1680, 2002.

[AJS15] Prabhanjan Ananth, Abhishek Jain, and Amit Sahai. Indistinguishability obfuscationfrom functional encryption for simple functions. Cryptology ePrint Archive, Report2015/730, 2015. https://eprint.iacr.org/2015/730.

[Ajt99] Miklos Ajtai. A non-linear time lower bound for boolean branching programs. In 40thAnnual Symposium on Foundations of Computer Science (Cat.No. 99CB37039), pages 60–70. IEEE, 1999.

[AKM+19] AmirMahdi Ahmadinejad, Jonathan Kelner, Jack Murtagh, John Peebles, AaronSidford, and Salil Vadhan. High-precision estimation of random walks in small space.arXiv preprint arXiv:1912.04524, 2019.

[AL16] Benny Applebaum and Shachar Lovett. Algebraic attacks against random local func-tions and their countermeasures. In STOC, pages 1087–1100. ACM, 2016.

[Alo95] NogaAlon. Tools fromhigher algebra.Handbook of combinatorics, 2:1749–1783, 1995.

[AOW15] Sarah R. Allen, RyanO’Donnell, and DavidWitmer. How to refute a randomCSP.In 2015 IEEE 56thAnnual Symposium on Foundations of Computer Science—FOCS 2015,pages 689–708. IEEE Computer Soc., Los Alamitos, CA, 2015.

[App13] BennyApplebaum. Pseudorandom generators with long stretch and low locality fromrandom local one-way functions. SIAM J. Comput., 42(5):2008–2037, 2013.

279

https://eprint.iacr.org/2015/730

[App16] BennyApplebaum. Cryptographic hardness of random local functions - survey. Com-putational Complexity, 25(3):667–722, 2016.

[AR99] Yonatan Aumann and Michael O. Rabin. Information theoretically secure commu-nication in the limited storage space model. In Advances in Cryptology - CRYPTO ’99,19th Annual International Cryptology Conference, Santa Barbara, California, USA, Au-gust 15-19, 1999, Proceedings, pages 65–79, 1999.

[Arm98] Roy Armoni. On the derandomization of space-bounded computations. In Interna-tional Workshop on Randomization and Approximation Techniques in Computer Science,pages 47–59. Springer, 1998.

[ATSWZ00] Roy Armoni, Amnon Ta-Shma, Avi Wigderson, and Shiyu Zhou. AnO(log(n)4/3) space algorithm for (s, t) connectivity in undirected graphs. Journal of theACM, 47(2):294–311, 2000.

[Bar89] David A Barrington. Bounded-width polynomial-size branching programs recognizeexactly those languages in nc1. Journal of Computer and System Sciences, 38(1):150–164,1989.

[BBKK18] Boaz Barak, Zvika Brakerski, Ilan Komargodski, and Pravesh K. Kothari. Limits onlow-degree pseudorandom generators (or: Sum-of-squares meets program obfuscation).In Advances in Cryptology - EUROCRYPT 2018 - 37th Annual International Conferenceon theTheory andApplications ofCryptographicTechniques, TelAviv, Israel, April 29 -May3, 2018 Proceedings, Part II, pages 649–679, 2018.

[BCG03] Elwyn R Berlekamp, John Horton Conway, and Richard K Guy. Winning ways foryour mathematical plays, volume 3. AK Peters Natick, 2003.

[BCG18] MarkBraverman,GilCohen, andSumeghaGarg. Hitting setswithnear-optimal errorfor read-once branching programs. In Proceedings of the 50th Annual ACM SIGACTSymposium on Theory of Computing, pages 353–362, 2018.

[BCK15] Boaz Barak, Siu On Chan, and Pravesh K. Kothari. Sum of squares lower boundsfrom pairwise independence [extended abstract]. In STOC’15—Proceedings of the 2015ACM Symposium on Theory of Computing, pages 97–106. ACM, New York, 2015.

[BCP83] Allan Borodin, Stephen Cook, and Nicholas Pippenger. Parallel computation forwell-endowed rings and space-bounded probabilistic machines. Information and Con-trol, 58(1-3):113–136, 1983.

[BEHL12] Ido Ben-Eliezer, Rani Hod, and Shachar Lovett. Random low-degree polynomialsare hard to approximate. Computational Complexity, 21(1):63–81, 2012.

[BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, andManfred KWarmuth.Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM),36(4):929–965, 1989.

280

[Ber69] ER Berlekamp. On subsets with intersections of even cardinality. Canad. Math. Bull,12(4):471–477, 1969.

[BFJ+94] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, andSteven Rudich. Weakly learning dnf and characterizing statistical query learning usingfourier analysis. In Proceedings of the twenty-sixth annual ACM symposium on Theory ofcomputing, pages 253–262, 1994.

[BGY18] Paul Beame, ShayanOveis Gharan, andXin Yang. Time-space tradeoffs for learning fi-nite functions from random evaluations, with applications to polynomials. InConferenceOn Learning Theory, pages 843–856, 2018.

[BL06] Yonatan Bilu and Nathan Linial. Lifts, discrepancy and nearly optimal spectral gap.Combinatorica, 26(5):495–519, 2006.

[BM16] Boaz Barak and Ankur Moitra. Noisy tensor completion via the sum-of-squares hier-archy. In COLT, volume 49 of JMLRWorkshop and Conference Proceedings, pages 417–445. JMLR.org, 2016.

[BPW11] A. Bogdanov, P. Papakonstaninou, and A. Wan. Pseudorandomness for read-onceformulas. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science(FOCS), pages 240–246. IEEE, 2011.

[BPW12] A. Bogdanov, P. Papakonstantinou, andA.Wan. Pseudorandomness for linear lengthbranching programs and stack machines. In APPROX-RANDOM, pages 447–458.Springer, 2012.

[BR94] M. Bellare and J. Rompel. Randomness-efficient oblivious sampling. In Proceedings ofthe 35th Annual Symposium on Foundations of Computer Science, 1994, pages 276–287.IEEE, 1994.

[BRRY14] M. Braverman, A. Rao, R. Raz, and A. Yehudayoff. Pseudorandom generators forregular branching programs. SIAM Journal on Computing, 43(3):973–986, 2014.

[BSSV03] Paul Beame, Michael Saks, Xiaodong Sun, and Erik Vee. Time-space trade-off lowerbounds for randomized computation of decision problems. Journal of the ACM (JACM),50(2):154–195, 2003.

[BV10] JoshuaBrody andEladVerbin. The coinproblemandpseudorandomness for branchingprograms. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science,pages 30–39. IEEE, 2010.

[CC17] AmitChakrabarti and YiningChen. Time-space tradeoffs for thememory game. arXivpreprint arXiv:1712.01330, 2017.

[CG88] BennyChor andOdedGoldreich. Unbiased bits from sources ofweak randomness andprobabilistic communication complexity. SIAM Journal on Computing, 17(2):230–261,1988.

281

[CH20] Kuan Cheng andWilliamHoza. Hitting sets give two-sided derandomization of smallspace. Electronic Colloquium on Computational Complexity (ECCC), 27:16, 2020.

[CL20] Eshan Chattopadhyay and Jyun-Jie Liao. Optimal error pseudodistributions for read-once branching programs. arXiv preprint arXiv:2002.07208, 2020.

[CM97] Christian Cachin and Ueli M. Maurer. Unconditional security against memory-bounded adversaries. In Advances in Cryptology - CRYPTO ’97, 17th Annual Interna-tional Cryptology Conference, Santa Barbara, California, USA, August 17-21, 1997, Pro-ceedings, pages 292–306, 1997.

[CM01] MaryCryan andPeter BroMiltersen. Onpseudorandomgenerators inNC. In 26th In-ternational Symposium onMathematical Foundations of Computer Science, MFCS, pages272–284, 2001.

[CMVW16] Michael S. Crouch, AndrewMcGregor, Gregory Valiant, and David P.Woodruff.Stochastic streams: Sample complexity vs. space complexity. InESA, volume 57 ofLIPIcs,pages 32:1–32:15. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016.

[Dan16] Amit Daniely. Complexity theoretic limitations on learning halfspaces. In Proceed-ings of the 48th Annual ACMSIGACT Symposium on Theory of Computing, STOC 2016,Cambridge, MA, USA, June 18-21, 2016, pages 105–117, 2016.

[De11] A. De. Pseudorandomness for permutation and regular branching programs. In2011 IEEE26thAnnualConference onComputationalComplexity (CCC), pages 221–231.IEEE, 2011.

[DGKR19] Ilias Diakonikolas, Themis Gouleakis, Daniel M Kane, and Sankeerth Rao. Com-munication and memory efficient testing of discrete distributions. In Conference onLearning Theory, pages 1070–1106, 2019.

[DLS13] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. More data speeds up trainingtime in learning halfspaces over sparse vectors. InNIPS, pages 145–153, 2013.

[DLS14a] AmitDaniely, Nati Linial, and Shai Shalev-Shwartz. The complexity of learning half-spaces using generalized linear methods. In COLT, volume 35 of JMLRWorkshop andConference Proceedings, pages 244–286. JMLR.org, 2014.

[DLS14b] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexityto improper learning complexity. In STOC, pages 441–448. ACM, 2014.

[DM04] Stefan Dziembowski and Ueli M. Maurer. On generating the initial key in thebounded-storage model. In Advances in Cryptology - EUROCRYPT 2004, Interna-tional Conference on the Theory and Applications of Cryptographic Techniques, Interlaken,Switzerland,May 2-6, 2004, Proceedings, pages 126–137, 2004.

[DS16] Amit Daniely and Shai Shalev-Shwartz. Complexity theoretic limitations on learningdnf’s. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York,USA, June 23-26, 2016, pages 815–830, 2016.

282

[DS18] Yuval Dagan andOhad Shamir. Detecting correlations with littlememory and commu-nication. In Conference On Learning Theory, pages 1145–1198, 2018.

[DSTS17] Dean Doron, Amir Sarid, and Amnon Ta-Shma. On approximating the eigenvaluesof stochastic matrices in probabilistic logspace. Computational Complexity, 26(2):393–420, 2017.

[Fei02] Uriel Feige. Relations between average case complexity and approximation complexity.In Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing,pages 534–543. ACM, New York, 2002.

[Fei19] Uriel Feige. A randomized strategy in the mirror game. arXiv preprintarXiv:1901.07809, 2019.

[FK18] Michael A Forbes and Zander Kelley. Pseudorandom generators for read-once branch-ing programs, in any order. In 2018 IEEE 59th Annual Symposium on Foundations ofComputer Science (FOCS), pages 946–955. IEEE, 2018.

[FLVMV05] Lance Fortnow, Richard Lipton, Dieter Van Melkebeek, and Anastasios Viglas.Time-space lower bounds for satisfiability. Journal of the ACM (JACM), 52(6):835–865,2005.

[For00] Lance Fortnow. Time–space tradeoffs for satisfiability. Journal of Computer and SystemSciences, 60(2):337–353, 2000.

[GKR20] Sumegha Garg, Pravesh K Kothari, and Ran Raz. Time-space tradeoffs for distin-guishing distributions and applications to security of goldreich’s prg. arXiv preprintarXiv:2002.07235, 2020.

[GLM20] Alon Gonen, Shachar Lovett, and Michal Moshkovitz. Towards a combinatorialcharacterization of bounded memory learning. arXiv preprint arXiv:2002.03123, 2020.

[GMR+12] P. Gopalan, R. Meka, O. Reingold, L. Trevisan, and S. Vadhan. Better pseudo-random generators frommilder pseudorandom restrictions. In 2012 IEEE 53rd AnnualSymposium on Foundations of Computer Science (FOCS), pages 120–129. IEEE, 2012.

[GMRZ13] P. Gopalan, R.Meka, O. Reingold, and D. Zuckerman. Pseudorandom generatorsfor combinatorial shapes. SIAM Journal on Computing, 42(3):1051–1076, 2013.

[Gol00] Oded Goldreich. Candidate one-way functions based on expander graphs. ElectronicColloquium on Computational Complexity (ECCC), 7(90), 2000.

[Gol11] O. Goldreich. A sample of samplers: A computational perspective on sampling. InStudies in Complexity and Cryptography, volume 6650 of Lecture Notes in Computer Sci-ence. Springer, 2011.

[GRT18] Sumegha Garg, Ran Raz, and Avishay Tal. Extractor-based time-space lower boundsfor learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory ofComputing, pages 990–1002, 2018.

283

[GRT19] Sumegha Garg, Ran Raz, and Avishay Tal. Time-space lower bounds for two-passlearning. In 34th Computational Complexity Conference (CCC 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.

[GS71] Ronald L Graham and Joel H Spencer. A constructive solution to a tournament prob-lem. CanadianMathematical Bulletin, 14(1):45–48, 1971.

[GS18] Sumegha Garg and Jon Schneider. The space complexity of mirror games. In 10thInnovations in Theoretical Computer Science Conference (ITCS 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

[GV17] R. Gurjar and B. Volk. Pseudorandom bits for oblivious branching programs. arXivpreprint arXiv:1708.02054, 2017.

[GW97] O. Goldreich and A. Wigderson. Tiny families of functions with random properties:A quality-size trade-off for hashing. Random Struct. Algorithms, 11(4):315–343, 1997.

[GZ19] JiaxinGuan andMarkZhandary. Simple schemes in the bounded storagemodel. InAn-nual InternationalConference on theTheory andApplications of Cryptographic Techniques,pages 500–524. Springer, 2019.

[HZ18] William Hoza and David Zuckerman. Simple optimal hitting sets for small-success rl.In 2018 IEEE 59thAnnual Symposium on Foundations of Computer Science (FOCS), pages59–64. IEEE, 2018.

[IKO+11] Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky,Manoj Prabhakaran, andAmit Sahai.Efficient non-interactive secure computation. InAdvances in Cryptology - EUROCRYPT2011 - 30th Annual International Conference on the Theory and Applications of Crypto-graphic Techniques, Tallinn, Estonia,May 15-19, 2011. Proceedings, pages 406–425, 2011.

[IKW02] R. Impagliazzo, V. Kabanets, and A. Wigderson. In search of an easy witness: Expo-nential time vs. probabilistic polynomial time. Journal of Computer and System Sciences,65(4):672–694, 2002.

[Imm88] Neil Immerman. Nondeterministic space is closed under complementation. SIAMJournal on computing, 17(5):935–938, 1988.

[IMZ12] R. Impagliazzo, R.Meka, andD. Zuckerman. Pseudorandomness from shrinkage. In2012 IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS), pages111–119. IEEE, 2012.

[INW94] R. Impagliazzo, N. Nisan, and A. Wigderson. Pseudorandomness for network algo-rithms. In Proceedings of the 26th Annual ACM Symposium on Theory of Computing,STOC 1994, pages 356–364. ACM, 1994.

[Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of theACM (JACM), 45(6):983–1006, 1998.

284

[KI04] V. Kabanets and R. Impagliazzo. Derandomizing polynomial identity tests means prov-ing circuit lower bounds. Computational Complexity, 13(1-2):1–46, 2004.

[KL18] Pravesh K. Kothari and Roi Livni. Improper learning by refuting. In 9th Innovationsin Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge,MA, USA, pages 55:1–55:10, 2018.

[KMOW17] Pravesh K. Kothari, Ryuhei Mori, Ryan O’Donnell, and David Witmer. Sum ofsquares lower bounds for refuting any CSP. In Proceedings of the 49th Annual ACMSIGACT Symposium on Theory of Computing, STOC 2017,Montreal, QC, Canada, June19-23, 2017, pages 132–145, 2017.

[KNP11] M. Kouckỳ, P. Nimbhorkar, and P. Pudlák. Pseudorandom generators for groupproducts. In Proceedings of the forty-third annual ACM symposium on Theory of comput-ing, pages 263–272. ACM, 2011.

[KR13] Gillat Kol and Ran Raz. Interactive channel capacity. In Proceedings of the forty-fifthannual ACM symposium on Theory of computing, pages 715–724, 2013.

[KRT17] Gillat Kol, RanRaz, and Avishay Tal. Time-space hardness of learning sparse parities.In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing,pages 1067–1080, 2017.

[Lin16a] Huijia Lin. Indistinguishability obfuscation from constant-degree graded encodingschemes. In Advances in Cryptology - EUROCRYPT 2016 - 35th Annual InternationalConference on the Theory and Applications of Cryptographic Techniques, Vienna, Austria,May 8-12, 2016, Proceedings, Part I, pages 28–57, 2016.

[Lin16b] Huijia Lin. Indistinguishability obfuscation from sxdhon5-linearmaps and locality-5prgs. Cryptology ePrint Archive, Report 2016/1096, 2016. https://eprint.iacr.org/2016/1096.

[LT17] Huijia Lin and Stefano Tessaro. Indistinguishability obfuscation from trilinear mapsand block-wise local prgs. In Advances in Cryptology - CRYPTO 2017 - 37th Annual In-ternationalCryptologyConference, SantaBarbara, CA,USA,August 20-24, 2017, Proceed-ings, Part I, pages 630–660, 2017.

[LV16] Huijia Lin and Vinod Vaikuntanathan. Indistinguishability obfuscation from ddh-likeassumptions on constant-degree graded encodings. In IEEE 57th Annual Symposium onFoundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, NewBrunswick, New Jersey, USA, pages 11–20, 2016.

[LV17a] Alex Lombardi and Vinod Vaikuntanathan. Limits on the locality of pseudorandomgenerators and applications to indistinguishability obfuscation. InTheory ofCryptography- 15th International Conference, TCC, volume 10677, pages 119–137. Springer, 2017.

[LV17b] AlexLombardi andVinodVaikuntanathan.Minimizing the complexity ofGoldreich’spseudorandom generator. IACR Cryptology ePrint Archive, page 277, 2017.

285



[Mau92] Ueli M. Maurer. Conditionally-perfect secrecy and a provably-secure randomized ci-pher. J. Cryptology, 5(1):53–66, 1992.

[MM17] Dana Moshkovitz and Michal Moshkovitz. Mixing implies lower bounds for spacebounded learning. In Conference on Learning Theory, pages 1516–1566, 2017.

[MM18] DanaMoshkovitz andMichalMoshkovitz. Entropy samplers and strong generic lowerbounds for space bounded learning. In 9th Innovations in Theoretical Computer ScienceConference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

[MRSV17] JackMurtagh,OmerReingold, Aaron Sidford, and Salil Vadhan. Derandomizationbeyond connectivity: Undirected laplacian systems in nearly logarithmic space. In 2017IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 801–812. IEEE, 2017.

[MRT19] RaghuMeka,OmerReingold, andAvishayTal. Pseudorandomgenerators forwidth-3 branching programs. In Proceedings of the 51st Annual ACM SIGACT Symposium onTheory of Computing, pages 626–637, 2019.

[MST03] ElchananMossel, Amir Shpilka, and Luca Trevisan. On e-biased generators in NC0.In FOCS, pages 136–145. IEEE Computer Society, 2003.

[MST06] Elchanan Mossel, Amir Shpilka, and Luca Trevisan. On ε-biased generators in NC0.Random Structures Algorithms, 29(1):56–81, 2006.

[MT17] MichalMoshkovitz andNaftali Tishby. Mixing complexity and its applications to neu-ral networks. arXiv preprint arXiv:1703.00729, 2017.

[Nis92] N.Nisan. Pseudorandom generators for space-bounded computation. Combinatorica,12(4):449–461, 1992.

[Nis94] N. Nisan. RL ⊆ SC. Computational Complexity, 4(1):1–11, 1994.

[NSW92] N. Nisan, E. Szemeredi, and A.Wigderson. Undirected connectivity in O(log(n)1.5)space. In Proceedings of the 33rd Annual Symposium on Foundations of Computer Science,1992, pages 24–29. IEEE, 1992.

[NW94] N.Nisan andA.Wigderson. Hardness vs randomness. J. Comput. Syst. Sci., 49(2):149–167, 1994.

[NZ96] N. Nisan and D. Zuckerman. Randomness is linear in space. Journal of Computer andSystem Sciences, 52(1):43–52, 1996.

[OW14] Ryan O’Donnell and David Witmer. Goldreich’s PRG: evidence for near-optimalpolynomial stretch. In IEEE 29th Conference on Computational Complexity—CCC 2014,pages 1–12. IEEE Computer Soc., Los Alamitos, CA, 2014.

[Raz05] Ran Raz. Extractors with weak random seeds. In Proceedings of the thirty-seventh an-nual ACM symposium on Theory of computing, pages 11–20, 2005.

286

[Raz16] Ran Raz. Fast learning requires good memory: A time-space lower bound for paritylearning. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science(FOCS), pages 266–275. IEEE, 2016.

[Raz17] Ran Raz. A time-space lower bound for a large class of learning problems. In 2017IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 732–742. IEEE, 2017.

[Rei08] O. Reingold. Undirected connectivity in log-space. Journal of the ACM (JACM),55(4):17, 2008.

[RR99] R. Raz andO.Reingold. On recycling the randomness of states in space bounded com-putation. In Proceedings of the thirty-first annual ACM symposium on Theory of comput-ing, pages 159–168. ACM, 1999.

[RRS17] Prasad Raghavendra, Satish Rao, and Tselil Schramm. Strongly refuting random cspsbelow the spectral threshold. In STOC, pages 121–131. ACM, 2017.

[RSV13] Omer Reingold, Thomas Steinke, and Salil Vadhan. Pseudorandomness for regu-lar branching programs via fourier analysis. In APPROX-RANDOM, pages 655–670.Springer, 2013.

[RTV06] Omer Reingold, Luca Trevisan, and Salil Vadhan. Pseudorandom walks on regulardigraphs and the RL vs. L problem. In STOC, volume 6, pages 457–466, 2006.

[RV05] Eyal Rozenman and Salil Vadhan. Derandomized squaring of graphs. In APPROX-RANDOM, pages 436–447. Springer, 2005.

[RVW01] O. Reingold, S. Vadhan, and A. Wigderson. Entropy waves, the zig-zag graph prod-uct, and new constant-degree expanders and extractors. InElectronic Colloquium onCom-putational Complexity (ECCC), page 18, 2001. https://eccc.weizmann.ac.il/report/2001/018/.

[Sav70] W. J. Savitch. Relationships between nondeterministic and deterministic tape complex-ities. Journal of computer and system sciences, 4(2):177–192, 1970.

[Sha14] Ohad Shamir. Fundamental limits of online and distributed algorithms for statisticallearning and estimation. In Advances in Neural Information Processing Systems, pages163–171, 2014.

[SSV19] Vatsal Sharan,AaronSidford, andGregoryValiant. Memory-sample tradeoffs for linearregression with small error. In Proceedings of the 51st Annual ACM SIGACT Symposiumon Theory of Computing, pages 890–901, 2019.

[Ste12] T. Steinke. Pseudorandomness for permutation branching programswithout the grouptheory. In Electronic Colloquium on Computational Complexity (ECCC), volume 19,page 6, 2012.

287

https://eccc.weizmann.ac.il/report/2001/018/

https://eccc.weizmann.ac.il/report/2001/018/

[SV84] M Santha and UVVazirani. Generating quasi-random sequences from slightly-randomsources. InProceedings of the 25thAnnual SymposiumonFoundations ofComputer Science,1984, pages 434–440, 1984.

[SVW14] Thomas Steinke, Salil Vadhan, and Andrew Wan. Pseudorandomness and fouriergrowthbounds forwidth-3 branching programs. InApproximation, Randomization, andCombinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014).Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.

[SVW16] Jacob Steinhardt, Gregory Valiant, and StefanWager. Memory, communication, andstatistical queries. In Conference on Learning Theory, pages 1490–1516, 2016.

[SZ99] M. Saks and S. Zhou. BPHSPACE(s) ⊆ DSPACE(s3/2). Journal of computer andsystem sciences, 58(2):376–403, 1999.

[ŠŽ11] Jiří Šíma and Stanislav Žák. Almost k-wise independent sets establish hitting sets forwidth-3 1-branching programs. In International Computer Science Symposium in Russia,pages 120–133. Springer, 2011.

[Sza11] Tibor Szabó. http://discretemath.imp.fu-berlin.de/dmii-2011-12/linalgmethod.pdf,Winter 2011.

[Tri08] Vladimir Trifonov. An O(log(n) log log(n)) space algorithm for undirected st-connectivity. SIAM Journal on Computing, 38(2):449–483, 2008.

[TT18] Stefano Tessaro and Aishwarya Thiruvengadam. Provable time-memory trade-offs:symmetric cryptography against memory-bounded adversaries. In Theory of Cryptogra-phy Conference, pages 3–32. Springer, 2018.

[Vad03] Salil P. Vadhan. On constructing locally computable extractors and cryptosystems inthe bounded storagemodel. InAdvances in Cryptology - CRYPTO2003, 23rdAnnual In-ternational Cryptology Conference, Santa Barbara, California, USA, August 17-21, 2003,Proceedings, pages 61–77, 2003.

[Vad11] S. P. Vadhan. Pseudorandomness. Foundations and Trends in Theoretical ComputerScience, 2011.

[Vad17] Salil P. Vadhan. On learning vs. refutation. In Proceedings of the 30th Conference onLearningTheory, COLT2017, Amsterdam,TheNetherlands, 7-10 July 2017, pages 1835–1848, 2017.

[VV16] Gregory Valiant and Paul Valiant. Information theoretically secure databases. arXivpreprint arXiv:1605.02646, 2016.

[Wig19] AviWigderson. Mathematics and Computation: A Theory Revolutionizing Technologyand Science. Princeton University Press, 2019.

[Wil06] RyanWilliams. Inductive time-space lower bounds for sat and related problems. Com-putational Complexity, 15(4):433–470, 2006.

288

[Zuc97] D. Zuckerman. Randomness-optimal oblivious sampling. Random Structures andAlgorithms, 11(4):345–367, 1997.

[Zuc07] D. Zuckerman. Linear degree extractors and the inapproximability of max clique andchromatic number. Theory of Computing, 3:103–128, 2007.

289

implicationsofspace-boundedcomputation...implicationsofspace-boundedcomputation sumeghagarg...

Documents