mathematical statistics - 213.230.96.51:8090

258
American Mathematical Society Alexander Korostelev Olga Korosteleva Mathematical Statistics Asymptotic Minimax Theory Graduate Studies in Mathematics Volume 119

Upload: others

Post on 10-Jan-2022

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mathematical Statistics - 213.230.96.51:8090

American Mathematical Society

Alexander KorostelevOlga Korosteleva

Mathematical StatisticsAsymptotic Minimax Theory

Graduate Studies in Mathematics

Volume 119

gsm-119-korostelev-cov.indd 1 12/7/10 4:06 PM

Page 2: Mathematical Statistics - 213.230.96.51:8090

Mathematical StatisticsAsymptotic Minimax Theory

Page 3: Mathematical Statistics - 213.230.96.51:8090
Page 4: Mathematical Statistics - 213.230.96.51:8090

Mathematical StatisticsAsymptotic Minimax Theory

Alexander Korostelev Olga Korosteleva

American Mathematical SocietyProvidence, Rhode Island

Graduate Studies in Mathematics

Volume 119

Page 5: Mathematical Statistics - 213.230.96.51:8090

EDITORIAL COMMITTEE

David Cox (Chair)Rafe Mazzeo

Martin ScharlemannGigliola Staffilani

2010 Mathematics Subject Classification. Primary 62F12, 62G08;Secondary 62F10, 62G05, 62G10, 62G20.

For additional information and updates on this book, visitwww.ams.org/bookpages/gsm-119

Library of Congress Cataloging-in-Publication Data

Korostelev, A. P. (Aleksandr Petrovich)Mathematical statistics : asymptotic minimax theory / Alexander Korostelev, Olga Korostel-

eva.p. cm. — (Graduate studies in mathematics ; v. 119)

Includes bibliographical references and index.ISBN 978-0-8218-5283-5 (alk. paper)1. Estimation theory. 2. Asymptotic efficiencies (Statistics) 3. Statistical hypothesis testing.

I. Korostelev, Olga. II. Title.

QA276.8.K667 2011519.5–dc22 2010037408

Copying and reprinting. Individual readers of this publication, and nonprofit librariesacting for them, are permitted to make fair use of the material, such as to copy a chapter for usein teaching or research. Permission is granted to quote brief passages from this publication inreviews, provided the customary acknowledgment of the source is given.

Republication, systematic copying, or multiple reproduction of any material in this publicationis permitted only under license from the American Mathematical Society. Requests for suchpermission should be addressed to the Acquisitions Department, American Mathematical Society,201 Charles Street, Providence, Rhode Island 02904-2294 USA. Requests can also be made bye-mail to [email protected].

c© 2011 by the American Mathematical Society. All rights reserved.The American Mathematical Society retains all rightsexcept those granted to the United States Government.

Printed in the United States of America.

©∞ The paper used in this book is acid-free and falls within the guidelinesestablished to ensure permanence and durability.

Visit the AMS home page at http://www.ams.org/

10 9 8 7 6 5 4 3 2 1 16 15 14 13 12 11

Page 6: Mathematical Statistics - 213.230.96.51:8090

Contents

Preface ix

Part 1. Parametric Models

Chapter 1. The Fisher Efficiency 3

§1.1. Statistical Experiment 3

§1.2. The Fisher Information 6

§1.3. The Cramer-Rao Lower Bound 7

§1.4. Efficiency of Estimators 8

Exercises 9

Chapter 2. The Bayes and Minimax Estimators 11

§2.1. Pitfalls of the Fisher Efficiency 11

§2.2. The Bayes Estimator 13

§2.3. Minimax Estimator. Connection Between Estimators 16

§2.4. Limit of the Bayes Estimator and Minimaxity 18

Exercises 19

Chapter 3. Asymptotic Minimaxity 21

§3.1. The Hodges Example 21

§3.2. Asymptotic Minimax Lower Bound 22

§3.3. Sharp Lower Bound. Normal Observations 26

§3.4. Local Asymptotic Normality (LAN) 28

§3.5. The Hellinger Distance 31

§3.6. Maximum Likelihood Estimator 33

v

Page 7: Mathematical Statistics - 213.230.96.51:8090

vi Contents

§3.7. Proofs of Technical Lemmas 35

Exercises 40

Chapter 4. Some Irregular Statistical Experiments 43

§4.1. Irregular Models: Two Examples 43

§4.2. Criterion for Existence of the Fisher Information 44

§4.3. Asymptotically Exponential Statistical Experiment 45

§4.4. Minimax Rate of Convergence 47

§4.5. Sharp Lower Bound 47

Exercises 49

Chapter 5. Change-Point Problem 51

§5.1. Model of Normal Observations 51

§5.2. Maximum Likelihood Estimator of Change Point 54

§5.3. Minimax Limiting Constant 56

§5.4. Model of Non-Gaussian Observations 57

§5.5. Proofs of Lemmas 59

Exercises 62

Chapter 6. Sequential Estimators 65

§6.1. The Markov Stopping Time 65

§6.2. Change-Point Problem. Rate of Detection 69

§6.3. Minimax Limit in the Detection Problem. 73

§6.4. Sequential Estimation in the Autoregressive Model 75

Exercises 83

Chapter 7. Linear Parametric Regression 85

§7.1. Definitions and Notations 85

§7.2. Least-Squares Estimator 87

§7.3. Properties of the Least-Squares Estimator 89

§7.4. Asymptotic Analysis of the Least-Squares Estimator 93

Exercises 96

Part 2. Nonparametric Regression

Chapter 8. Estimation in Nonparametric Regression 101

§8.1. Setup and Notations 101

§8.2. Asymptotically Minimax Rate of Convergence. Definition 103

§8.3. Linear Estimator 104

Page 8: Mathematical Statistics - 213.230.96.51:8090

Contents vii

§8.4. Smoothing Kernel Estimator 106

Exercises 112

Chapter 9. Local Polynomial Approximation of the RegressionFunction 115

§9.1. Preliminary Results and Definition 115

§9.2. Polynomial Approximation and Regularity of Design 119

§9.3. Asymptotically Minimax Lower Bound 122

§9.4. Proofs of Auxiliary Results 126

Exercises 130

Chapter 10. Estimation of Regression in Global Norms 131

§10.1. Regressogram 131

§10.2. Integral L2-Norm Risk for the Regressogram 133

§10.3. Estimation in the Sup-Norm 136

§10.4. Projection on Span-Space and Discrete MISE 138

§10.5. Orthogonal Series Regression Estimator 141

Exercises 148

Chapter 11. Estimation by Splines 151

§11.1. In Search of Smooth Approximation 151

§11.2. Standard B-splines 152

§11.3. Shifted B-splines and Power Splines 155

§11.4. Estimation of Regression by Splines 158

§11.5. Proofs of Technical Lemmas 161

Exercises 166

Chapter 12. Asymptotic Optimality in Global Norms 167

§12.1. Lower Bound in the Sup-Norm 167

§12.2. Bound in L2-Norm. Assouad’s Lemma 171

§12.3. General Lower Bound 174

§12.4. Examples and Extensions 177

Exercises 182

Part 3. Estimation in Nonparametric Models

Chapter 13. Estimation of Functionals 185

§13.1. Linear Integral Functionals 185

§13.2. Non-Linear Functionals 188

Page 9: Mathematical Statistics - 213.230.96.51:8090

viii Contents

Exercises 191

Chapter 14. Dimension and Structure in Nonparametric Regression 193

§14.1. Multiple Regression Model 193

§14.2. Additive regression 196

§14.3. Single-Index Model 199

§14.4. Proofs of Technical Results 206

Exercises 209

Chapter 15. Adaptive Estimation 211

§15.1. Adaptive Rate at a Point. Lower Bound 211

§15.2. Adaptive Estimator in the Sup-Norm 215

§15.3. Adaptation in the Sequence Space 218

§15.4. Proofs of Lemmas 223

Exercises 225

Chapter 16. Testing of Nonparametric Hypotheses 227

§16.1. Basic Definitions 227

§16.2. Separation Rate in the Sup-Norm 229

§16.3. Sequence Space. Separation Rate in the L2-Norm 231

Exercises 237

Bibliography 239

Index of Notation 241

Index 243

Page 10: Mathematical Statistics - 213.230.96.51:8090

Preface

This book is based on the lecture notes written for the advanced Ph.D. levelstatistics courses delivered by the first author at the Wayne State Univer-sity over the last decade. It has been easy to observe how the gap deepensbetween applied (computational) and theoretical statistics. It has becomemore difficult to direct and mentor graduate students in the field of math-ematical statistics. The research monographs in this field are extremelydifficult to use as textbooks. Even in the best published lecture notes theintensive material of original studies is typically included. On the otherhand, the classical courses in statistics that cover the traditional parametricpoint and interval estimation methods and hypotheses testing are hardlysufficient for the teaching goals in modern mathematical statistics.

In this book, we tried to give a general overview of the key statisticaltopics, parametric and nonparametric, as a set of very special optimizationproblems. As a criterion for optimality of estimators we chose minimaxrisks, and we focused on asymptotically minimax rates of convergence forlarge samples. Definitely, the selection of models presented in this book fol-lows our preferences. Many very important problems and examples are notincluded. The simplest models were deliberately selected for presentation,and we consciously concentrated on the detailed proofs of all propositions.We believe that mathematics students should be trained in proof-writing tobe better prepared for applications in statistics.

This textbook can form a reasonable basis for a two-semester course inmathematical statistics. Every chapter is followed by a collection of exercisesconsisting partly of verification of technical results, and partly of important

ix

Page 11: Mathematical Statistics - 213.230.96.51:8090

x Preface

illustrative examples. In our opinion, the sufficient prerequisite is a stan-dard course in advanced probability supported by undergraduate statisticsand real analysis. We hope that students who successfully pass this courseare prepared for reading original papers and monographs in the minimaxestimation theory and can be easily introduced to research studies in thisfield.

This book is organized into three parts. Part 1 is comprised of Chap-ters 1-7 that contain fundamental topics of local asymptotic normality aswell as irregular statistical models, change-point problem, and sequentialestimation. For convenience of reference we also included a chapter on clas-sical parametric linear regression with the concentration on the asymptoticalproperties of least-squares estimators. Part 2 (Chapters 8-12) focuses on es-timation of nonparametric regression functions. We restrict the presentationto estimation at a point and in the quadratic and uniform norms, and con-sider deterministic as well as random designs. The last part of the book,Chapters 13-16, is devoted to special more modern topics such as influence ofhigher-dimension and structure in nonparametric regression models, prob-lems of adaptive estimation, and testing of nonparametric hypotheses. Wepresent the ideas through simple examples with the equidistant design.

Most chapters are weakly related to each other and may be covered inany order. Our suggestion for a two-semester course would be to cover theparametric part during the first semester and to cover the nonparametricpart and selected topics in the second half of the course.

We are grateful to O. Lepskii for his advice and help with the presenta-tion of Part 3.

The authors, October 2010

Page 12: Mathematical Statistics - 213.230.96.51:8090

Part 1

Parametric Models

Page 13: Mathematical Statistics - 213.230.96.51:8090
Page 14: Mathematical Statistics - 213.230.96.51:8090

Chapter 1

The Fisher Efficiency

1.1. Statistical Experiment

A classical statistical experiment(X1, . . . , Xn; p(x, θ); θ ∈ Θ

)is composed of

the following three elements: (i) a set of independent observationsX1, . . . , Xn

where n is the sample size, (ii) a family of probability densities p(x, θ) de-fined by a parameter θ, and (iii) a parameter set Θ of all possible values ofθ.

Unless otherwise stated, we always assume that θ is one-dimensional,that is, Θ ⊆ R. For discrete distributions, p (x, θ) is the probability massfunction. In this chapter we formulate results only for continuous distri-butions. Analogous results hold for discrete distributions if integration isreplaced by summation. Some discrete distributions are used in examples.

Example 1.1. (a) If n independent observations X1, . . . , Xn have a normaldistribution with an unknown mean θ and a known variance σ2, that is,Xi ∼ N (θ, σ2), then the density is

p (x, θ) = (2π σ2)−1/2 exp{− (x− θ)2/(2σ2)

}, −∞ < x, θ < ∞,

and the parameter set is the whole real line Θ = R.

(b) If n independent observations have a normal distribution with a knownmean μ and an unknown variance θ, that is, Xi ∼ N (μ, θ), then the densityis

p(x, θ) = (2π θ)−1/2 exp{− (x− μ)2/(2θ)

}, −∞ < x < ∞ , θ > 0,

and the parameter set is the positive half-axis Θ = { θ ∈ R : θ > 0 }. �

3

Page 15: Mathematical Statistics - 213.230.96.51:8090

4 1. The Fisher Efficiency

Example 1.2. Suppose n independent observations X1, . . . , Xn come froma distribution with density

p(x, θ) = p 0(x− θ), −∞ < x, θ < ∞,

where p 0 is a fixed probability density function. Here θ determines the shiftof the distribution, and therefore is termed the location parameter. Thelocation parameter model can be written as Xi = θ+ εi, i = 1, . . . , n, whereε1, . . . , εn are independent random variables with a given density p 0, andθ ∈ Θ = R. �

The independence of observations implies that the joint density of Xi’sequals

p (x1, . . . , xn, θ) =n∏

i=1

p (xi, θ).

We denote the respective expectation by Eθ[ · ] and variance by Varθ[ · ].In a statistical experiment, all observations are obtained under the same

value of an unknown parameter θ. The goal of the parametric statisticalestimation is to assess the true value of θ from the observations X1, . . . , Xn.An arbitrary function of observations, denoted by θ = θn = θn(X1, . . . , Xn),is called an estimator (or a point estimator) of θ.

A random variable

l(Xi , θ) = ln p(Xi , θ)

is referred to as a log-likelihood function related to the observation Xi.The joint log-likelihood function of a sample of size n (or, simply, the log-likelihood function) is the sum

Ln(θ) = Ln(θ |X1 , . . . , Xn) =n∑

i=1

l(Xi , θ) =n∑

i=1

ln p(Xi , θ).

In the above notation, we emphasize the dependence of the log-likelihoodfunction on the parameter θ, keeping in mind that it is actually a randomfunction that depends on the entire set of observations X1, . . . , Xn.

The parameter θ may be evaluated by the method of maximum likelihoodestimation. An estimator θ ∗

n is called the maximum likelihood estimator(MLE), if for any θ ∈ Θ the following inequality holds:

Ln(θ∗n ) ≥ Ln(θ).

If the log-likelihood function attains its unique maximum, then the MLEreduces to

θ ∗n = argmax

θ∈ΘLn(θ).

Page 16: Mathematical Statistics - 213.230.96.51:8090

1.1. Statistical Experiment 5

If the function L is differentiable at its attainable maximum, then θ ∗n is a

solution of the equation∂Ln(θ)

∂θ= 0.

Note that if the maximum is not unique, this equation has multiple solutions.

The function

bn(θ) = bn(θ , θn) = Eθ

[θn]− θ = Eθ

[θn(X1, . . . , Xn)

]− θ

is called the bias of θn. An estimator θn(X1, . . . , Xn) is called an unbiased

estimator of θ if its bias equals zero, or equivalently, Eθ

[θn]

= θ for allθ ∈ Θ.

Example 1.3. Assume that the underlying distribution of the random sam-ple X1, . . . , Xn is Poisson with mean θ. The probability mass function isgiven by

pn(x, θ) =θx

x!e−θ , θ > 0, x ∈ {0, 1, 2, . . . }.

Then the log-likelihood function has the form

Ln(θ) =

n∑

i=1

Xi ln θ − nθ −n∑

i=1

ln (Xi!).

Setting the derivative equal to zero yields the solution θ ∗n = Xn, where

Xn = (X1 + · · ·+Xn)/n

denotes the sample mean. In this example, the MLE is unbiased sinceEθ

[θ ∗n

]= Eθ

[Xn

]= Eθ

[X1

]= θ. �

Nonetheless, we should not take the unbiased MLE for granted. Evenfor common densities, its expected value may not exist. Consider the nextexample.

Example 1.4. For the exponential distribution with the density

p(x, θ) = θ exp{− θ x

}, x > 0, θ > 0,

the MLE θ ∗n = 1/Xn has the expected value Eθ

[θ ∗n

]= n θ/(n − 1) (see

Exercise 1.6). In particular, for n = 1, the expectation does not exist since∫∞0 x−1 θ exp

{− θ x

}dx = ∞. �

In this example, however, an unbiased estimator may be found for n > 1.Indeed, the estimator (n− 1)θ ∗

n/n is unbiased. As the next example shows,an unbiased estimator may not exist at all.

Page 17: Mathematical Statistics - 213.230.96.51:8090

6 1. The Fisher Efficiency

Example 1.5. Let X be a Binomial(n , θ 2) observation, that is, a randomnumber of successes in n independent Bernoulli trials with the probability ofa success p = θ 2 , 0 < θ < 1. An unbiased estimator of the parameter θ doesnot exist. In fact, if θ = θ(X) were such an estimator, then its expectationwould be an even polynomial of θ,

[θ(X)

]=

n∑

k=0

θ(k)

(n

k

)θ 2k (1− θ 2)n−k,

which cannot be identically equal to θ. �

1.2. The Fisher Information

Introduce the Fisher score function as the derivative of the log-likelihoodfunction with respect to θ,

l′(Xi , θ) =∂ ln p(Xi , θ)

∂θ=

∂p(Xi , θ)/∂θ

p(Xi , θ).

Note that the expected value of the Fisher score function is zero. Indeed,

[l′(Xi , θ)

]=

R

∂p(x , θ)

∂θdx =

∂∫Rp(x , θ) dx

∂θ= 0.

The total Fisher score function for a sample X1 , . . . , Xn is defined as thesum of the score functions for each individual observation,

L′n(θ) =

n∑

i=1

l′(Xi , θ).

The Fisher information of one observation Xi is the variance of the Fisherscore function l′(Xi , θ),

I(θ) = Varθ[l′(Xi , θ)

]= Eθ

[ (l′(Xi , θ)

)2]

= Eθ

[(∂ ln p (X, θ)

∂ θ

)2 ]=

R

( ∂ ln p(x , θ)

∂θ

) 2p(x , θ) dx

=

R

(∂p(x , θ)/∂θ

)2

p(x , θ)dx.

Remark 1.6. In the above definition of the Fisher information, the densityappears in the denominator. Thus, it is problematic to calculate the Fisherinformation for distributions with densities that may be equal to zero forsome values of x; even more so, if the density turns into zero as a function ofx on sets that vary depending on the value of θ. A more general approach tothe concept of information that overcomes this difficulty will be suggestedin Section 4.2. �

Page 18: Mathematical Statistics - 213.230.96.51:8090

1.3. The Cramer-Rao Lower Bound 7

The Fisher information for a statistical experiment of size n is the vari-ance of the total Fisher score function,

In(θ) = Varθ[L′n(θ)

]= Eθ

[ (L′n(θ)

) 2 ]

= Eθ

[(∂ ln p (X1, . . . , Xn, θ)

∂ θ

)2 ]

=

Rn

(∂ p (x1, . . . , xn, θ)/∂θ)2

p (x1, . . . , xn, θ)dx1 . . . dxn .

Lemma 1.7. For independent observations, the Fisher information is ad-ditive. In particular, for any θ ∈ Θ , the equation holds In(θ) = n I(θ).

Proof. As the variance of the sum of n independent random variables,

In(θ) = Varθ[L′n(θ)

]= Varθ

[l′(X1 , θ) + . . . + l′(Xn , θ)

]

= nVarθ[l′(X1 , θ)

]= n I(θ). �

In view of this lemma, we use the following definition of the Fisherinformation for a random sample of size n:

In(θ) = nEθ

[(∂ ln p (X, θ)

∂ θ

)2 ].

Another way of computing the Fisher information is presented in Exercise1.1.

1.3. The Cramer-Rao Lower Bound

A statistical experiment is called regular if its Fisher information is con-tinuous, strictly positive, and bounded for all θ ∈ Θ . Next we present aninequality for the variance of any estimator of θ in a regular experiment.This inequality is termed the Cramer-Rao inequality, and the lower boundis known as the Cramer-Rao lower bound.

Theorem 1.8. Consider an estimator θn = θn(X1, . . . , Xn) of the parame-

ter θ in a regular experiment. Suppose its bias bn(θ) = Eθ

[θn]− θ is con-

tinuously differentiable. Let b′n (θ) denote the derivative of the bias. Then

the variance of θn satisfies the inequality

(1.1) Varθ[θn]≥(1 + b′n(θ)

)2

In(θ), θ ∈ Θ.

Proof. By the definition of the bias, we have that

θ + bn(θ) = Eθ

[θn]=

Rn

θn(x1, . . . , xn) p (x1, . . . , xn, θ) dx1 . . . dxn.

Page 19: Mathematical Statistics - 213.230.96.51:8090

8 1. The Fisher Efficiency

In the regular case, the differentiation and integration are interchangeable,hence differentiating in θ , we get the equation,

1 + b′n (θ) =

Rn

θn(x1, . . . , xn)[∂p (x1, . . . , xn, θ)/∂θ

]dx1 . . . dxn

=

Rn

θn(x1, . . . , xn)(∂p (x1, . . . , xn, θ)/∂θ

p (x1, . . . , xn, θ)

)p (x1, . . . , xn, θ) dx1 . . . dxn

= Eθ

[θn L

′n(θ)

]= Covθ

[θn , L

′n(θ)

]

where we use the fact that Eθ

[L′n(θ)

]= 0. The correlation coefficient ρn of

θn and L′n(θ) does not exceed 1 in its absolute value, so that

1 ≥ ρ2n =

(Covθ

[θn , L

′n(θ)

] )2

Varθ[θn] Varθ[L′n(θ)]

=(1 + b′n(θ))

2

Varθ[θn] In(θ). �

1.4. Efficiency of Estimators

An immediate consequence of Theorem 1.8 is the formula for unbiased esti-mators.

Corollary 1.9. For an unbiased estimator θn , the Cramer-Rao inequality(1.1) takes the form

(1.2) Varθ[θn]≥ 1

In(θ), θ ∈ Θ. �

An unbiased estimator θ ∗n = θ ∗

n (X1, . . . , Xn) in a regular statisticalexperiment is called Fisher efficient (or, simply, efficient) if, for any θ ∈ Θ,the variance of θ ∗

n reaches the Cramer-Rao lower bound, that is, the equalityin (1.2) holds:

Varθ[θ ∗n

]=

1

In(θ), θ ∈ Θ.

Example 1.10. Suppose, as in Example 1.1(a), the observationsX1, . . . , Xn

are independent N (θ, σ2) where σ2 is assumed known. We show that thesample mean Xn = (X1 + · · ·+ Xn)/n is an efficient estimator of θ. Indeed,Xn is unbiased and Varθ

[Xn

]= σ2/n. On the other hand,

ln p (X, θ) = − 1

2ln(2π σ2) − (X − θ)2

2σ2

and

l′(X , θ) =∂ ln p (X, θ)

∂θ=

X − θ

σ2.

Thus, the Fisher information for the statistical experiment is

In(θ) = nEθ

[(l′(X , θ)

)2]=

n

σ4Eθ

[(X − θ)2

]=

nσ2

σ4=

n

σ2.

Page 20: Mathematical Statistics - 213.230.96.51:8090

Exercises 9

Therefore, for any value of θ, the variance of Xn achieves the Cramer-Raolower bound 1/In(θ) = σ2/n. �

The concept of the Fisher efficiency seems to be nice and powerful. In-deed, besides being unbiased, an efficient estimator has the minimum pos-sible variance uniformly in θ ∈ Θ. Another feature is that it applies to anysample size n. Unfortunately, this concept is extremely restrictive. It worksonly in a limited number of models. The main pitfalls of the Fisher efficiencyare discussed in the next chapter.

Exercises

Exercise 1.1. Show that the Fisher information can be computed by theformula

In(θ) = −nEθ

[∂2 ln p (X, θ)

∂ θ2

].

Hint: Make use of the representation (show!)

(∂ ln p (x, θ)

∂ θ

)2p (x, θ) =

∂2 p (x, θ)

∂θ2−(∂2 ln p (x, θ)

∂θ2

)p (x, θ).

Exercise 1.2. LetX1, . . . , Xn be independent observations with theN (μ, θ)distribution, where μ has a known value (refer to Example 1.1(b)). Provethat

θ ∗n =

1

n

n∑

i=1

(Xi − μ)2

is an efficient estimator of θ. Hint: Use Exercise 1.1 to show that In(θ) =n/(2 θ2). When computing the variance of θ ∗

n , first notice that the variable∑ni=1 (Xi−μ)2/θ has a chi-squared distribution with n degrees of freedom,

and, thus, its variance equals 2n.

Exercise 1.3. Suppose that independent observations X1, . . . , Xn have aBernoulli distribution with the probability mass function

p (x, θ) = θ x (1− θ)1−x, x ∈ { 0, 1 } , 0 < θ < 1.

Show that the Fisher information is of the form

In(θ) =n

θ (1− θ),

and verify that the estimator θ ∗n = Xn is efficient.

Page 21: Mathematical Statistics - 213.230.96.51:8090

10 1. The Fisher Efficiency

Exercise 1.4. Assume that X1, . . . , Xn are independent observations froma Poisson distribution with the probability mass function

p (x, θ) =θ x

x!e−θ, x ∈ { 0, 1, . . . }, θ > 0.

Prove that the Fisher information in this case is In(θ) = n/θ, and show thatXn is an efficient estimator of θ.

Exercise 1.5. Let X1, . . . , Xn be a random sample from an exponentialdistribution with the density

p (x, θ) =1

θe−x/θ, x > 0, θ > 0.

Verify that In(θ) = n/θ2, and prove that Xn is efficient.

Exercise 1.6. Show that in the exponential model with the density p(x , θ) =θ exp{−θ x} , x , θ > 0, the MLE θ ∗

n = 1/Xn has the expected value Eθ[ θ∗n ] =

n θ/(n− 1). What is the variance of this estimator?

Exercise 1.7. Show that for the location parameter model with the densityp(x , θ) = p 0(x − θ), introduced in Example 1.2, the Fisher information isa constant if it exists.

Exercise 1.8. In the Exercise 1.7, find the values of α for which the Fisherinformation exists if p 0(x) = C cosα x , −π/2 < x < π/2 , and p 0(x) = 0otherwise, where C = C(α) is the normalizing constant. Note that p 0 is aprobability density if α > −1 .

Page 22: Mathematical Statistics - 213.230.96.51:8090

Chapter 2

The Bayes andMinimax Estimators

2.1. Pitfalls of the Fisher Efficiency

Fisher efficient estimators defined in the previous chapter possess two ma-jor unattractive properties, which prevent the Fisher efficiency from beingwidely used in statistical theory. First, the Fisher efficient estimators rarelyexist, and second, they need to be unbiased. In effect, the Fisher efficiencydoes not provide an answer to how to compare biased estimators with dif-ferent bias functions. A lesser issue is that the comparison of estimators isbased on their variances alone.

Before we proceed to an illustrative example, we need several notionsdefined below. A function w(u), u ∈ R, is called a loss function if: (i) w(0) =0, (ii) it is symmetric, w(u) = w(−u), (iii) it is non-decreasing for u > 0,and (iv) it is not identically equal to zero. Besides, we require that w isbounded from above by a power function, that is, (v) w(u) ≤ k(1+ |u|a) forall u with some constants k > 0 and a > 0.

The loss function w(θn − θ) measures the deviation of the estimator

θn = θn(X1, . . . , Xn) from the true parameter θ. In this book, we do notgo far beyond: (i) quadratic loss function, w(u) = u2, (ii) absolute lossfunction, w(u) = |u|, or (iii) bounded loss function, w(u) = I( |u| > c ) witha given positive c, where I(·) denotes the indicator function.

The normalized risk function (or simply, the normalized risk)Rn(θ, θn, w)

is the expected value of the loss function w evaluated at√

In(θ)(θn − θ),

11

Page 23: Mathematical Statistics - 213.230.96.51:8090

12 2. The Bayes and Minimax Estimators

that is,

Rn(θ, θn, w) = Eθ

[w(√

In(θ)(θn − θ)) ]

=

Rn

w(√

In(θ)(θn(x1, . . . , xn)− θ))p(x1, . . . , xn , θ) dx1 . . . dxn.

Example 2.1. For the quadratic loss function w(u) = u2, the normalized

risk (commonly termed the normalized quadratic risk) of an estimator θncan be found as

Rn(θ, θn, u2) = Eθ

[In(θ)

(θn−θ

)2 ]= In(θ)Eθ

[ (θn−Eθ [ θn ]+Eθ [ θn ]−θ

)2 ]

(2.1) = In(θ)[Varθ[ θn ] + b 2n(θ , θn)

]

where bn(θ , θn) = Eθ [ θn ]− θ is the bias of θn. �

By (2.1), for any unbiased estimator θn, the normalized quadratic risk

function has the representation Rn(θ, θn, u2) = In(θ)Varθ[ θn ]. The Cramer-

Rao inequality (1.2) can thus be written as

(2.2) Rn(θ, θn, u2) = Eθ

[In(θ)

(θn − θ

)2 ] ≥ 1, θ ∈ Θ,

with the equality attained for the Fisher efficient estimators θ ∗n ,

(2.3) Rn(θ, θ∗n , u

2) = Eθ

[In(θ)

(θ ∗n − θ

)2 ]= 1, θ ∈ Θ.

Next, we present an example of a biased estimator that in a certain in-terval performs more efficiently than the Fisher efficient unbiased estimator,if we define a more efficient estimator as the one with a smaller normalizedquadratic risk.

Example 2.2. LetX1, . . . , Xn be independent observations from theN (θ, σ2)distribution, where σ2 is known. Consider two estimators: (i) θ ∗

n = Xn,

which is efficient by Example 1.10, and (ii) a constant-value estimator θ = θ0,where θ0 is a fixed point. The normalized quadratic risk of θ ∗

n equals the

unity by (2.3), while that of θ is

Rn(θ, θ, u2) = Eθ

[In(θ)(θ − θ)2

]=

n

σ2(θ0 − θ)2.

Note that θ is a biased estimator with the bias bn(θ) = θ0 − θ.

It is impossible to determine which of the two normalized quadratic risksis smaller (refer to Figure 1). If θ is within θ0±σ/

√n, then θ is more efficient,

whereas for all other values of θ, θ∗ is a more efficient estimator. �

Page 24: Mathematical Statistics - 213.230.96.51:8090

2.2. The Bayes Estimator 13

0

1

•θ0θ0 − σ√

nθ0 + σ√

n θ

Rn

Rn(θ, θ, u2) = n

σ2 (θ0 − θ)2

Rn(θ, θ∗n , u

2) = 1

θ is more efficient

in this interval

Figure 1. The normalized quadratic risk functions in Example 2.2.

This example illustrates the difficulty in comparing normalized risks oftwo estimators as functions of θ ∈ Θ. To overcome it, we could try torepresent each risk function by a positive number. In statistics, there aretwo major ways to implement this idea. One approach is to integrate thenormalized risk over the parameter set Θ, whereas the other one is to takethe maximum value of the normalized risk function over Θ. These are calledthe Bayes and the minimax approaches, respectively. They are explored inthe next three sections.

2.2. The Bayes Estimator

In what follows, we study only regular statistical models, which by definitionhave a strictly positive, continuous Fisher information.

Assume that there is a probability density π(θ) defined on the parameterset Θ. The density π(θ) is called a prior density of θ. It reflects the judgementof how likely values of θ are before the data are obtained. The Bayes riskof θn is the integrated value of the normalized risk function,

(2.4) βn(θn, w, π) =

ΘRn(θ, θn, w)π(θ) dθ .

An estimator tn = tn(X1, . . . , Xn) is called the Bayes estimator of θ, if for

any other estimator θn, the following inequality holds:

βn(tn, w, π) ≤ βn(θn, w, π).

Page 25: Mathematical Statistics - 213.230.96.51:8090

14 2. The Bayes and Minimax Estimators

In other words, the Bayes estimator minimizes the Bayes risk. Looselyspeaking, we can understand the Bayes estimator as a solution of the mini-mization problem,

tn = argminθn β(θn, w, π),

though we should keep in mind that the minimum value may not exist ormay be non-unique.

In the case of the quadratic loss w(u) = u2, the Bayes estimator canbe computed explicitly. Define the posterior density of θ as the conditionaldensity, given the observations X1, . . . , Xn; that is,

f(θ |X1, . . . , Xn) = Cn p(X1, . . . , Xn , θ)π(θ), θ ∈ Θ,

where Cn = Cn (X1, . . . , Xn) is the normalizing constant. Assuming that∫

ΘIn(θ) f(θ |X1 , . . . , Xn) dθ < ∞,

we can introduce the weighted posterior density as

f(θ |X1 , . . . , Xn) = Cn In(θ) f(θ |X1 , . . . , Xn), θ ∈ Θ,

with the normalizing constant Cn =[ ∫

Θ In(θ) f(θ |X1 , . . . , Xn) dθ]−1

, whichis finite under our assumption.

Theorem 2.3. If w(u) = u2, then the Bayes estimator tn is the weightedposterior mean

tn = tn(X1 , . . . , Xn) =

Θθ f(θ |X1, . . . , Xn) dθ.

In particular, if the Fisher information is a constant independent of θ, thenthe Bayes estimator is the non-weighted posterior mean,

tn = tn(X1 , . . . , Xn) =

Θθ f(θ |X1, . . . , Xn) dθ.

Proof. The Bayes risk of an estimator θn with respect to the quadratic losscan be written in the form

βn(θn, π) =

Θ

Rn

In(θ) (θn − θ)2 p(x1, . . . , xn , θ)π(θ) dx1 . . . dxn dθ

=

Rn

[ ∫

Θ(θn − θ)2 f( θ |x1, . . . , xn) dθ

]C−1n (x1, . . . , xn) dx1 . . . dxn.

Thus, the minimization problem of the Bayes risk is tantamount to mini-mization of the integral

Θ(θn − θ)2 f( θ |x1, . . . , xn) dθ

Page 26: Mathematical Statistics - 213.230.96.51:8090

2.2. The Bayes Estimator 15

with respect to θn for any fixed values x1, . . . , xn. Equating to zero thederivative of this integral with respect to θn produces a linear equation,satisfied by the Bayes estimator tn,

Θ(tn − θ) f(θ |x1, . . . , xn) dθ = 0.

Recalling that∫Θ f(θ |x1, . . . , xn) dθ = 1, we obtain the result,

tn =

Θθ f(θ |x1, . . . , xn) dθ. �

In many examples, the weighted posterior mean tn is easily computableif we choose a prior density π(θ) from a conjugate family of distributions.A conjugate prior distribution π(θ) is such that the posterior distributionbelongs to the same family of distributions for any sample X1 , . . . , Xn. Ifthe posterior distribution allows a closed-form expression of expectations,then tn can be found without integration. The following example illustratesthe idea.

Example 2.4. Consider independent Bernoulli observations X1 , . . . , Xn

with the probability mass function

p(x, θ) = θ x (1− θ)1−x , x ∈ {0 , 1}, 0 < θ < 1,

where θ is assumed to be a random variable. The joint distribution functionof the sample is

p(X1 , . . . , Xn , θ) = θ∑

Xi (1− θ)n−∑

Xi .

As a function of θ, it has an algebraic form or a beta distribution. Thus, weselect a beta density as a prior density,

π(θ) = C(α, β) θ α−1 (1− θ) β−1 , 0 < θ < 1,

where α and β are positive parameters, and C(α, β) is the normalizing con-stant. The posterior density is then also a beta density,

f(θ |X1, . . . , Xn

)= C(α, β) θ

∑Xi+α−1 (1− θ)n−

∑Xi+β−1 , 0 < θ < 1.

By Exercise 1.3, the Fisher information is equal to In(θ) = n/[ θ(1 − θ)].Thus, the weighted posterior density is a beta density as well,

f(θ |X1, . . . , Xn

)= Cnθ

∑Xi+α−2 (1− θ)n−

∑Xi+β−2, 0 < θ < 1,

where α > 1 and β > 1. The weighted posterior mean therefore is equal to

tn =

∑Xi + α− 1

∑Xi + α− 1 + n−

∑Xi + β − 1

=

∑Xi + α− 1

n+ α+ β − 2.

More examples of the conjugate families are in the exercises.

Page 27: Mathematical Statistics - 213.230.96.51:8090

16 2. The Bayes and Minimax Estimators

2.3. Minimax Estimator. Connection Between Estimators

Define a maximum normalized risk of an estimator θn = θn(X1, . . . , Xn)with respect to a loss function w by

rn(θn, w) = supθ∈Θ

Rn(θ, θn , w) = supθ∈Θ

[w(√

In(θ) (θn − θ))]

.

An estimator θ ∗n = θ ∗

n (X1, . . . , Xn) is called minimax if its maximum

normalized risk does not exceed that of any other estimator θn. That is, forany estimator θn,

rn(θ∗n , w) ≤ rn(θn, w).

The maximum normalized risk of a minimax estimator, rn(θ∗n , w), is called

the minimax risk.

In contrast with the Bayes estimator, the minimax estimator representsa different concept of the statistical optimality. The Bayes estimator isoptimal in the averaged (integrated) sense, whereas the minimax one takesinto account the “worst-case scenario”.

It follows from the above definition that a minimax estimator θ ∗n solves

the optimization problem

supθ∈Θ

[w(√

In(θ) (θn − θ))]

→ infθn

.

Finding the infimum over all possible estimators θn = θn(X1, . . . , Xn), thatis, over all functions of observations X1 , . . . , Xn , is not an easily tackledtask. Even for the most common distributions, such as normal or binomial,the direct minimization is a hopeless endeavor. This calls for an alternativeroute in finding minimax estimators.

In this section we establish a connection between the Bayes and minimaxestimators that will lead to some advances in computing the latter. Thefollowing theorem shows that if the Bayes estimator has a constant risk,then it is also minimax.

Theorem 2.5. Let tn = tn(X1, . . . , Xn) be a Bayes estimator with respectto a loss function w. Suppose that the normalized risk function of the Bayesestimator is a constant for any θ ∈ Θ, that is,

Rn(θ, tn , w) = Eθ

[w(√

In(θ) ( tn − θ)) ]

= c

for some c > 0. Then tn is also a minimax estimator.

Proof. Notice that since the risk function of tn is a constant, the Bayes andmaximum normalized risks of tn are the same constants. Indeed, letting

Page 28: Mathematical Statistics - 213.230.96.51:8090

2.3. Minimax Estimator. Connection Between Estimators 17

π(θ) denote the corresponding prior density, we write

βn(tn , w, π) =

ΘRn(θ, tn , w)π(θ) dθ = c

Θπ(θ) dθ = c

and

rn(tn, w) = supθ∈Θ

Rn(θ, tn , w) = supθ∈Θ

c = c.

Further, for any estimator θn,

rn(θn, w) = supθ∈Θ

Rn(θ, θn , w) ≥∫

ΘRn(θ, θn , w)π(θ) dθ

= βn(θn , w, π) ≥ βn(tn , w, π) = c = rn(tn , w). �

Unfortunately, Theorem 2.5 does not provide a recipe for choosing aprior density for which the normalized risk function is a constant on Θ .Moreover, constant-risk priors rarely exist. Below we give two exampleswhere we try to explain why it happens.

Example 2.6. Consider independent Bernoulli observations X1, . . . , Xn

with parameter θ. As shown in Example 2.4, the weighted posterior meanof θ is

tn =

∑Xi + α− 1

n+ α+ β − 2.

If we now select α = β = 1, then tn becomes the sample mean Xn. FromExercise 1.3 we know that Xn is an efficient estimator of θ, and therefore itsweighted quadratic risk is equal to 1, a constant. However, α = β = 1 is nota legitimate choice in this instance, because the weighted posterior density

f(θ |X1, . . . , Xn) = Cnθ∑

Xi−1 (1− θ)n−∑

Xi−1

does not exist for∑

Xi = 0. Indeed, θ−1 (1 − θ)n−1 is not integrable at

zero, and therefore the normalizing constant Cn does not exist. �

Example 2.7. LetX1, . . . , Xn be independent observations from theN (θ, 1)distribution. If we choose the prior density of θ to be N (0, b2) for some pos-itive real b, then, by Exercise 2.10, the weighted posterior distribution isalso normal,

N( n b2 Xn

n b2 + 1,

b2

n b2 + 1

).

Here the weighted posterior mean tn = n b2 Xn/(n b2 + 1) is the Bayesestimator with respect to the quadratic loss function. If we let b → ∞, thentn equals Xn, which is Fisher efficient (see Example 1.10) and thus has aconstant normalized quadratic risk. The flaw in this argument is that nonormal prior density exists with infinite b. �

Page 29: Mathematical Statistics - 213.230.96.51:8090

18 2. The Bayes and Minimax Estimators

2.4. Limit of the Bayes Estimator and Minimaxity

Assume that we can find a family of prior distributions with the densitiesπb(θ) indexed by a positive real number b. If the Bayes risks of the respectiveBayes estimators have a limit as b goes to infinity, then this limit guaranteesa minimax lower bound. A rigorous statement is presented in the followingtheorem.

Theorem 2.8. Let πb(θ) be a family of prior densities on Θ that dependon a positive real parameter b, and let tn(b) = tn(X1, . . . , Xn , b) be therespective Bayes estimators for a loss function w. Suppose that the Bayesrisk βn

(tn(b), w, πb

)has a limit,

limb→∞

βn(tn(b), w, πb

)= c > 0.

Then the minimax lower bound holds for any n,

infθn

rn(θn , w) = infθn

supθ∈Θ

[w(√

In(θ) (θn − θ)) ]

≥ c.

Proof. As in the proof of Theorem 2.5, for any estimator θn, we can write

rn(θn , w) = supθ∈Θ

Rn(θ, θn , w) ≥∫

ΘRn(θ, θn , w)πb(θ) dθ

= βn(θn , w, πb) ≥ βn(tn(b), w, πb

).

Now take the limit as b → ∞. Since the left-hand side is independent of b,the theorem follows. �

Example 2.9. Let X1, . . . , Xn be independent N (θ, 1) observations. Wewill show that conditions of Theorem 2.8 are satisfied under the quadraticloss function w(u) = u2, and therefore the lower bound for the correspondingminimax risk holds:

infθn

rn(θn , w ) = infθn

supθ∈R

[ (√n (θn − θ)

)2] ≥ 1.

As shown in Example 2.7, for aN (0, b2) prior density, the weighted posteriormean tn(b) = n b2 Xn/(n b2 + 1) is the Bayes estimator with respect to thequadratic loss function. Now we will compute its Bayes risk. This estimatorhas the variance

Varθ[tn(b)

]=

n2 b4 Varθ[Xn

]

(n b2 + 1)2=

n b4

(n b2 + 1)2

and the bias

bn(θ, tn(b)

)= Eθ

[tn(b)

]− θ =

n2 b2 θ

n b2 + 1− θ = − θ

n b2 + 1.

Page 30: Mathematical Statistics - 213.230.96.51:8090

Exercises 19

Therefore, its normalized quadratic risk is expressed as

Rn

(θ , tn(b), w

)= Eθ

[(√n (tn(b) − θ)

)2]= n[Varθ[ tn(b) ] + b2n

(θ , tn(b)

)]

=n2 b4

(n b2 + 1)2+

n θ2

(n b2 + 1)2.

With the remark that∫Rθ2 πb (θ) dθ = b2, the Bayes risk of tn(b) equals

βn(tn(b), w, πb

)=

R

[ n2 b4

(n b2 + 1)2+

n θ2

(n b2 + 1)2

]πb (θ) dθ

=n2 b4

(n b2 + 1)2+

n b2

(n b2 + 1)2→ 1 as b → ∞.

Applying Theorem 2.8, we obtain the result with c = 1. Taking a stepfurther, note that the minimax lower bound is attained for the estimator

Xn, which is thus minimax. Indeed, Eθ

[(√n (Xn − θ)

)2]= 1. �

In subsequent chapters we present additional useful applications ofTheorem 2.8.

Exercises

Exercise 2.9. Suppose the random observations X1, . . . , Xn come from aPoisson distribution with the probability mass function

p(x , θ) =θ x e−θ

x!, x ∈ { 0 , 1 , . . . },

where θ is a random variable. Show that the conjugate prior density of θ isa gamma density, π(θ) = C(α, β) θ α−1 e−β θ, θ > 0, for some positive pa-rameters α and β, and the normalizing constant C(α, β). Find the weightedposterior mean of θ.

Exercise 2.10. Consider a set of independent observations X1, . . . , Xn ∼N (θ, σ2), where θ is assumed random with the prior density N (μ, σ2

θ). Showthat the weighted posterior distribution of θ is also normal with the mean(nσ2

θ Xn + μσ2)/(nσ2

θ + σ2)and variance σ2 σ2

θ/(nσ2θ + σ2). Note that the

family of normal distributions is self-conjugate.

Exercise 2.11. Find a conjugate distribution and the corresponding Bayesestimator for the parameter θ in the exponential model with p(x, θ) =θ exp{− θ x}, x , θ > 0.

Page 31: Mathematical Statistics - 213.230.96.51:8090

20 2. The Bayes and Minimax Estimators

Exercise 2.12. Consider n independent Bernoulli observations X1 , . . . , Xn

with p(x, θ) = θ x (1 − θ) 1−x, x ∈ { 0 , 1 }, and Θ = (0, 1). Define theestimator

θ ∗n =

∑Xi +

√n/ 2

n +√n

.

(i) Verify that θ ∗n is the non-weighted posterior mean with respect to the

conjugate prior density π(θ) = C[θ (1− θ)

]√n/2− 1, 0 < θ < 1.

(ii) Show that the non-normalized quadratic risk of θ ∗n (with the factor√

In(θ) omitted) is equal to

[(θ ∗

n − θ) 2]=

1

4(1 +√n) 2

.

(iii) Verify that Theorem 2.5 is valid for a non-normalized risk function, andargue that θ ∗

n is minimax in the appropriate sense.

Exercise 2.13. Refer to the Bernoulli model in Example 2.4. Show thatthe prior beta distribution with α = β = 1 + b−1 defines the weightedposterior mean tn(b) which is minimax for b = ∞.

Page 32: Mathematical Statistics - 213.230.96.51:8090

Chapter 3

AsymptoticMinimaxity

In this chapter we study the asymptotic minimaxity of estimators as thesample size n increases.

3.1. The Hodges Example

An estimator θn is called asymptotically unbiased if it satisfies the limitingcondition

limn→∞

[θn]= θ, θ ∈ Θ.

In many cases when an unbiased estimator of θ does not exist, an asymp-totically unbiased estimator is easy to construct.

Example 3.1. In Example 1.4, the MLE θ ∗n = 1/Xn, though biased for any

n > 1, is asymptotically unbiased. Indeed,

limn→∞

[θ ∗n

]= lim

n→∞n θ

n− 1= θ. �

Example 3.2. In Example 1.5, there is no unbiased estimator. The es-timator θn =

√X/n, however, is asymptotically unbiased (see Exercise

3.14.) �

In the previous chapter, we explained why the Fisher approach failsas a criterion for finding the most efficient estimators. Now we are plan-ning to undertake another desperate, though futile, task of rescuing theconcept of Fisher efficiency at least in an asymptotic form. The question

21

Page 33: Mathematical Statistics - 213.230.96.51:8090

22 3. Asymptotic Minimaxity

is: Can we define a sequence of asymptotically Fisher efficient estimatorsθ ∗n = θ ∗

n (X1, . . . , Xn) by requiring that they: (i) are asymptotically unbi-ased and (ii) satisfy the equation (compare to (2.3)):

(3.1) limn→∞

[In(θ)

(θn − θ

)2]= 1, θ ∈ Θ ?

The answer to this question would be positive, if for any sequence of asymp-totically unbiased estimators θn, the following analogue of the Cramer-Raolower bound (2.2) were true,

(3.2) limn→∞

[In(θ)

(θn − θ

)2] ≥ 1, θ ∈ Θ .

Indeed, if (3.2) held, then the estimator that satisfies (3.1) would be asymp-totically the most efficient one. However, it turns out that this inequalityis not valid even for N (θ, 1) observations. A famous Hodges example ispresented below.

Example 3.3. Consider independent observationsX1, . . . , Xn from N (θ, 1)distribution, θ ∈ R. Define the sequence of estimators

(3.3) θn =

{Xn if | Xn | ≥ n−1/4 ,

0 otherwise.

Note that in this example, In(θ) = n. It can be shown (see Exercise 3.15)that this sequence is asymptotically unbiased, and that the following equal-ities hold:

(3.4)

⎧⎨

limn→∞ Eθ

[n(θn − θ

)2]= 1 if θ = 0 ,

limn→∞ Eθ

[n θ 2

n

]= 0 if θ = 0.

Thus, the sequence θn is asymptotically more efficient than any asymptot-ically Fisher efficient estimator defined by (3.1). In particular, it is betterthan the sample mean Xn . Sometimes the Hodges estimator is called super-efficient, and the point at which the Cramer-Rao lower bound is violated,θ = 0, is termed the superefficient point. �

The above example explains why the asymptotic theory of parameterestimation should be based on methods other than the pointwise asymptoticFisher efficiency. We start introducing these methods in the next section.

3.2. Asymptotic Minimax Lower Bound

Recall from Section 2.3 that a minimax estimator corresponding to the qua-dratic loss function solves the minimization problem

supθ∈Θ

[In(θ)

Rn

(θn(x1 , . . . , xn)−θ

)2p(x1 , . . . , xn , θ) dx1 . . . dxn

]→ inf

θn

.

Page 34: Mathematical Statistics - 213.230.96.51:8090

3.2. Asymptotic Minimax Lower Bound 23

The minimization is carried over all arbitrary functions θn = θ(x1 , . . . , xn) .As discussed earlier, this problem is impenetrable from the point of view ofstandard analytic methods of calculus. In this section we will learn a bypass-ing approach based on the asymptotically minimax lower bound. Considerthe maximum normalized risk of an estimator θn with respect to the qua-dratic loss function

rn(θn, u2) = sup

θ∈ΘRn(θ, θn , u

2)

= supθ∈Θ

[In(θ) (θn − θ)2

]= n sup

θ∈ΘI(θ)Eθ

[(θn − θ)2

].

Suppose we can show that for any estimator θn the inequality

(3.5) lim infn→∞

rn(θn , u2) ≥ r∗

holds with a positive constant r∗ independent of n . This inequality impliesthat for any estimator θn and for all large enough n, the maximum of thequadratic risk is bounded from below,

supθ∈Θ

I(θ)Eθ

[(θn − θ

)2] ≥ r∗ − ε

n

with arbitrarily small ε > 0 . We call the inequality (3.5) the asymptoticallyminimax lower bound. If, in addition, we can find an estimator θ ∗

n , whichfor all large n satisfies the upper bound

supθ∈Θ

I(θ)Eθ

[(θ ∗n − θ

)2] ≤ r∗

n

with a positive constant r∗ , then for all large enough n, the minimax risk issandwiched between two positive constants,

(3.6) r∗ ≤ infθn

supθ∈Θ

[(√nI(θ) ( θn − θ )

)2] ≤ r∗.

In this special case of the quadratic loss function w(u) = u2, we definethe asymptotically minimax rate of convergence as 1/

√n (or, equivalently,

O(1/√n) as n → ∞). This is the fastest possible decrease rate of θn − θ

in the mean-squared sense as n → ∞. This rate is not improvable by anyestimator.

More generally, we call a deterministic sequence ψn the asymptoticallyminimax rate of convergence, if for some positive constants r∗ and r∗, andfor all sufficiently large n, the following inequalities hold:

(3.7) r∗ ≤ infθn

supθ∈Θ

[w( θn − θ

ψn

) ]≤ r∗ < ∞.

If r∗ = r∗, these bounds are called asymptotically sharp.

Page 35: Mathematical Statistics - 213.230.96.51:8090

24 3. Asymptotic Minimaxity

In the following lemma we explain the idea of how the asymptoticallyminimax lower bound (3.5) may be proved. We consider only normallydistributed observations, and leave some technical details out of the proof.

Lemma 3.4. Take independent observations X1, . . . , Xn ∼ N (θ, σ2) whereσ2 is known. Let θ ∈ Θ where Θ is an open interval containing the originθ = 0 . Then for any estimator θn, the following inequality holds:

lim infn→∞

rn(θn , u2) = lim inf

n→∞n

σ 2supθ∈Θ

[(θn − θ

)2] ≥ r∗ = 0.077.

Remark 3.5. Under the assumptions of Lemma 3.4, the maximum normal-ized risk rn(θn, u

2) admits the asymptotic upper bound r∗ = 1, guaranteedby the sample mean estimator Xn. �

Proof of Lemma 3.4. Without loss of generality, we can assume thatσ2 = 1

(hence I(θ) = 1

), and that Θ contains points θ0 = 0 and θ1 =

1/√n. Introduce the log-likelihood ratio associated with these values of the

parameter θ,

ΔLn = ΔLn(θ0, θ1) = Ln(θ1) − Ln(θ0)

= lnp(X1 , . . . , Xn , θ1)

p(X1 , . . . , Xn , θ0)=

n∑

i=1

lnp(Xi , 1/

√n)

p(Xi , 0)

=n∑

i=1

[− 1

2

(Xi −

1√n

)2+

1

2X2

i

]=

1√n

n∑

i=1

Xi −1

2= Z − 1

2

where Z is a N (0, 1) random variable with respect to the distribution Pθ0 .

Further, by definition, for any random function f(X1, . . . , Xn) , and forany values θ0 and θ1, the basic likelihood ratio identity relating the twoexpectations holds:

Eθ1

[f(X1, . . . , Xn)

]= Eθ0

[f(X1, . . . , Xn)

p(X1, . . . , Xn , θ1)

p(X1, . . . , Xn , θ0)

]

(3.8) = Eθ0

[f(X1, . . . , Xn) exp

{ΔLn(θ0 , θ1)

}].

Next, for any fixed estimator θn, the supremum over R of the normalizedrisk function is not less than the average of the normalized risk over the twopoints θ0 and θ1. Thus, we obtain the inequality

n supθ∈R

[(θn − θ)2

]≥ n max

θ∈{θ0, θ1}Eθ

[(θn − θ)2

]

≥ n

2

{Eθ0

[(θn − θ0)

2]+ Eθ1

[(θn − θ1)

2] }

=n

2Eθ0

[(θn − θ0)

2 + (θn − θ1)2 exp

{ΔLn(θ0 , θ1)

} ] (by (3.8)

)

Page 36: Mathematical Statistics - 213.230.96.51:8090

3.2. Asymptotic Minimax Lower Bound 25

≥ n

2Eθ0

[ ((θn − θ0)

2 + (θn − θ1)2)I

(ΔLn(θ0 , θ1) ≥ 0

) ]

≥ n

2

(θ1 − θ0)2

2Pθ0

(ΔLn(θ0 , θ1) ≥ 0

)

=n

4

( 1√n

)2Pθ0

(Z − 1/2 ≥ 0

)=

1

4Pθ0

(Z ≥ 1/2

).

In the above, if the log-likelihood ratio ΔLn(θ0 , θ1) is non-negative, then itsexponent is at least 1. At the last stage we used the elementary inequality

(x− θ0)2 + (x− θ1)

2 ≥ 1

2(θ1 − θ0)

2, x ∈ R.

As shown previously, Z is a standard normal random variable with respectto the distribution Pθ0 , therefore, Pθ0

(Z ≥ 1/2

)= 0.3085. Finally, the

maximum normalized risk is bounded from below by 0.3085/4 > 0.077. �

Remark 3.6. Note that computing the mean value of the normalized riskover two points is equivalent to finding the Bayes risk with respect to theprior distribution that is equally likely concentrated at these points. Thus,in the above proof, we could have taken a Bayes prior concentrated not attwo but at three or more points, then the lower bound constant r∗ would bedifferent from 0.077. �

The normal distribution of the observations in Lemma 3.4 is used onlyin the explicit formula for the log-likelihood ratio ΔLn(θ0 , θ1). A generaliza-tion of this lemma to the case of a statistical experiment with an arbitrarydistribution is stated in the theorem below. The proof of the theorem isanalogous to that of the lemma, and therefore is left as an exercise (seeExercise 3.16).

Theorem 3.7. Assume that an experiment (X1 , . . . , Xn ; p(x , θ) ; Θ) issuch that for some points θ0 and θ1 = θ0 + 1/

√n in Θ, the log-likelihood

ratio

ΔLn(θ0 , θ1) =n∑

i=1

lnp(Xi , θ1)

p(Xi , θ0)

satisfies the condition

Pθ0

(ΔLn(θ0 , θ1) ≥ z0

)≥ p0

with the constants z0 and p0 independent of n. Assume that z0 ≤ 0. Thenfor any estimator θn, the lower bound of the minimax risk holds:

lim infn→∞

supθ∈R

[In(θ)

(θn − θ

)2 ] ≥ 1

4I∗ p0 exp{z0}.

where I∗ = min[I(θ0) , I(θ1)

]> 0.

Page 37: Mathematical Statistics - 213.230.96.51:8090

26 3. Asymptotic Minimaxity

3.3. Sharp Lower Bound. Normal Observations

Lemma 3.4 leaves a significant gap between the lower and upper constantsin (3.6). Indeed, r∗ = 0.077, while r∗ = 1 by Remark 3.5. It should notcome as a surprise that in such a regular case as normal observations it canbe shown that r∗ = r∗. In this section, we prove the sharp lower boundwith r∗ = r∗ = 1 for the normal observations. To do this, we have toovercome the same technical difficulties and we will need the same ideas asin the case of more general observations discussed in the next section.

Theorem 3.8. Under the assumptions of Lemma 3.4, for any estimator θn,the following lower bound holds:

lim infn→∞

n

σ 2supθ∈Θ

[(θn − θ

)2] ≥ 1.

Proof. As in the proof of Lemma 3.4, we can take σ2 = 1. The idea of theproof is based on the substitution of the maximum normalized risk by theBayes risk with the uniform prior distribution in an interval

[−b/

√n , b/

√n]

where b will be chosen later. Under the assumption on Θ, it contains thisinterval for all sufficiently large n. Proceeding as in the proof of Lemma 3.4,we obtain the inequalities

supθ∈R

[n(θn − θ

)2 ] ≥√n

2b

∫ b/√n

−b/√nEθ

[n(θn − θ

)2 ]dθ

=1

2b

∫ b

−bEt/

√n

[ (√n θn − t

)2]dt (by substitution t =

√n θ)

(3.9) =1

2b

∫ b

−bE0

[ (√n θn − t

)2exp

{ΔLn

(0,

t√n

)} ]dt.

Here the same trick is used as in the proof of Lemma 3.4 with the changeof the distribution by means of the log-likelihood ratio, which in this case isequal to

ΔLn

(0,

t√n

)= Ln

( t√n

)− Ln(0) =

n∑

i=1

[− 1

2

(Xi −

t√n

)2+

1

2X2

i

]

=t√n

n∑

i=1

Xi −t2

2= t Z − t2

2=

Z 2

2− (t− Z)2

2

where Z ∼ N (0, 1) under P0. Thus, the latter expression can be written as

Page 38: Mathematical Statistics - 213.230.96.51:8090

3.3. Sharp Lower Bound. Normal Observations 27

E0

[eZ

2/2 1

2b

∫ b

−b

(√n θn − t

)2e−(t−Z)2/2 dt

]

(3.10) ≥ E0

[eZ

2/2I(|Z| ≤ a

) 1

2b

∫ b

−b

(√n θn − t

)2e−(t−Z)2/2 dt

]

where a is a positive constant, a < b. The next step is to change thevariable of integration to u = t − Z. The new limits of integration are[−b− Z , b− Z ]. For any Z that satisfies |Z| ≤ a, this interval includes theinterval [−(b − a) , b − a ], so that the integral over [−b , b ] with respect tot can be estimated from below by the integral in u over [−(b − a) , b − a ].Hence, for |Z| ≤ a,

∫ b

−b

(√n θn − t

)2e(t−Z)2/2 dt ≥

∫ b−a

−(b−a)

(√n θn − Z − u

)2e−u2/2 du

(3.11) =

∫ b−a

−(b−a)

[ (√n θn −Z

)2+ u2

]e−u2/2 du ≥

∫ b−a

−(b−a)u2 e−u2/2 du.

Here the cross term disappears because∫ b−a−(b−a) u exp{−u2/2} du = 0.

Further, we compute the expected value

(3.12) E0

[eZ

2/2I(|Z| ≤ a

)]=

∫ a

−aez

2/2 1√2π

e−z2/2 dz =2a√2π

.

Putting together (3.11) and (3.12), and continuing from (3.10), we arrive atthe lower bound

supθ∈R

[n(θn − θ

)2 ] ≥ 2a√2π

1

2b

∫ b−a

−(b−a)u2 e−u2/2 du

(3.13) =a

bE

[Z20 I(|Z0| ≤ b− a

) ]

where Z0 is a standard normal random variable. Choose a and b such thata/b → 1 and b− a → ∞, for example, put a = b−

√b and let b → ∞. Then

the expression in (3.13) can be made however close to E[Z20

]= 1. �

The quadratic loss function is not critical in Theorem 3.8. The nexttheorem generalizes the result to any loss function.

Theorem 3.9. Under the assumptions of Theorem 3.8, for any loss functionw and any estimator θn, the following lower bound holds:

lim infn→∞

supθ∈Θ

[w(√ n

σ 2(θn − θ)

) ]≥∫ ∞

−∞

w(u)√2π

e−u2/2 du.

Page 39: Mathematical Statistics - 213.230.96.51:8090

28 3. Asymptotic Minimaxity

Proof. In the proof of Theorem 3.8, the quadratic loss function was usedonly to demonstrate that for any

√n θn −Z, the following inequality holds:

∫ b−a

−(b−a)

(√n θn − Z − u

)2e−u2/2 du ≥

∫ b−a

−(b−a)u2 e−u2/2 du.

We can generalize this inequality to any loss function as follows (see Exercise

3.18). The minimum value of the integral∫ b−a−(b−a) w

(c − u

)e−u2/2 du over

c ∈ R is attained at c = 0, that is,

�(3.14)

∫ b−a

−(b−a)w(c− u

)e−u2/2 du ≥

∫ b−a

−(b−a)w(u) e−u2/2 du.

Remark 3.10. Note that in the proof of Theorem 3.8 (respectively, The-orem 3.9), we considered the values of θ not in the whole parameter setΘ, but only in the interval

[− b/

√n, b/

√n]of however small the length.

Therefore, it is possible to formulate a local version of Theorem 3.9 with theproof remaining the same. For any loss function w, the inequality holds

limδ→0

lim infn→∞

sup|θ−θ0|<δ

[w(√ n

σ 2(θn−θ)

) ]≥∫ ∞

−∞

w(u)√2π

e−u2/2 du. �

3.4. Local Asymptotic Normality (LAN)

The sharp lower bounds of the previous section were proved under the re-strictive assumption of normal observations. How far can these results beextended? What is to be required from a statistical experiment to ensurean asymptotically sharp lower bound of the minimax risk similar to the onein Theorem 3.9? The answers to these and related questions comprise anessential part of modern mathematical statistics. Below we present someideas that stay within the scope of this book.

Assume that a statistical experiment is regular in the sense that theFisher information I(θ) is positive, continuous and bounded in Θ . Let θand θ + t/

√n belong to Θ for all t in some compact set. If the Taylor

expansion below is legitimate, we obtain that

ΔLn

(θ, θ + t/

√In(θ)

)= Ln

(θ + t/

√In(θ)

)− Ln(θ)

=

n∑

i=1

[l(Xi, θ + t/

√nI(θ)

)− l(Xi, θ)

]

(3.15) =t

√nI(θ)

n∑

i=1

l′(Xi, θ) +t2

2nI(θ)

n∑

i=1

l′′(Xi, θ)(1 + on(1)

)

where on(1) → 0 as n → ∞.

Page 40: Mathematical Statistics - 213.230.96.51:8090

3.4. Local Asymptotic Normality (LAN) 29

By the computational formula for the Fisher information (see Exercise 1.1),

[ n∑

i=1

l′′(Xi , θ)]= −nI(θ),

and therefore, the Law of Large Numbers ensures the convergence of thesecond term in (3.15),

t2

2nI(θ)

n∑

i=1

l′′(Xi, θ)(1 + on(1)

)→ − t2

2as n → ∞.

Thus, at the heuristic level of understanding, we can expect that for anyt ∈ R, the log-likelihood ratio satisfies

(3.16) ΔLn

(θ, θ + t/

√In(θ)

)= zn(θ) t − t2/2 + εn(θ, t)

where, by the Central Limit Theorem, the random variable

zn(θ) =1

√nI(θ)

n∑

i=1

l′(Xi, θ)

converges in Pθ-distribution to a standard normal random variable z(θ), andfor any δ > 0,

(3.17) limn→∞

(| εn(θ, t) | ≥ δ

)= 0.

A family of distributions for which the log-likelihood ratio has the repre-sentation (3.16) under constraint (3.17) is said to satisfy the local asymptoticnormality (LAN) condition. It can actually be derived under less restrictiveassumptions. In particular, we do not need to require the existence of thesecond derivative l′′.

To generalize Theorem 3.9 to the distributions satisfying the LAN con-dition, we need to justify that the remainder term εn(θ, t) may be ignoredin the expression for the likelihood ratio,

exp{ΔLn

(θ, θ + t/

√In(θ)

) }≈ exp

{zn(θ) t − t2/2

}.

To do this, we have to guarantee that the following approximation holds, asn → ∞,

(3.18) Eθ

[ ∣∣∣ exp

{zn(θ) t− t2/2 + εn(θ, t)

}− exp

{zn(θ) t − t2/2

}∣∣∣]→ 0.

Unfortunately, the condition (3.17) that warrants that εn(θ, t) vanishes inprobability does not imply (3.18). The remedy comes in the form of LeCam’stheorem stated below.

Page 41: Mathematical Statistics - 213.230.96.51:8090

30 3. Asymptotic Minimaxity

Theorem 3.11. Under the LAN conditions (3.16) and (3.17), there existsa sequence of random variables zn(θ) such that | zn(θ) − zn(θ) | → 0 inPθ-probability, as n → ∞, and for any c > 0,

limn→∞

sup−c≤t≤c

[ ∣∣∣ exp{zn(θ)t− t2/2 + εn(θ, t)

}− exp

{zn(θ)t− t2/2

}∣∣∣]= 0.

To ease the proof, we split it into lemmas proved as the technical resultsbelow.

Lemma 3.12. Under the LAN condition (3.16), there exists a truncationof zn(θ) defined by

zn(θ) = zn(θ) I(zn(θ) ≤ cn),

with the properly chosen sequence of constants cn , such that the followingequations hold:

(3.19) zn(θ) − zn(θ) → 0 as n → ∞and

(3.20) limn→∞

sup−c≤ t≤ c

[ ∣∣∣ exp{zn(θ) t − t2/2

}− 1

∣∣∣]= 0.

Introduce the notations

ξn(t) = exp{zn(θ) t − t2/2 + εn(θ, t)

},

ξn(t) = exp{zn(θ) t − t2/2

}, and ξ(t) = exp

{z(θ) t − t2/2

}

where zn(θ) is as defined in Lemma 3.12, and z(θ) is a standard normalrandom variable.

Lemma 3.13. Under the LAN condition (3.16), the tails of ξn(t) and ξn(t)are small, uniformly in n and t ∈ [−c, c], in the sense that

limA→∞

supn≥ 1

sup−c≤ t≤ c

[ξn(t) I(ξn(t) > A)

]

(3.21) = limA→∞

supn≥ 1

sup−c≤ t≤ c

[ξn(t) I(ξn(t) > A)

]= 0.

Now we are in the position to prove the LeCam theorem.

Proof of Theorem 3.11. We have to show that for any t ∈ [−c, c], theconvergence takes place:

(3.22) limn→∞

[ ∣∣ ξn(t) − ξn(t)

∣∣]= 0.

From the triangle inequality, we obtain that

[ ∣∣ ξn(t) − ξn(t)

∣∣]≤ Eθ

[ ∣∣∣ ξn(t) I

(ξn(t) ≤ A

)− ξ(t) I

(ξ(t) ≤ A

) ∣∣∣]

+Eθ

[ ∣∣∣ ξn(t) I

(ξn(t) ≤ A

)− ξ(t) I

(ξ(t) ≤ A

) ∣∣∣]

Page 42: Mathematical Statistics - 213.230.96.51:8090

3.5. The Hellinger Distance 31

(3.23) +Eθ

[ξn(t) I

(ξn(t) > A

) ]+ Eθ

[ξn(t) I

(ξn(t) > A

) ].

Due to Lemma 3.13, we can choose A so large that the last two terms donot exceed a however small positive δ. From Lemma 3.12, ξn(t)− ξ(t) → 0in Pθ-distribution, and by the LAN condition, ξn(t)−ξ(t) → 0, therefore, fora fixed A, the first two terms on the right-hand side of (3.23) are vanishinguniformly over t ∈ [−c, c] as n → ∞. �

Finally, we formulate the result analogous to Theorem 3.9 (for the proofsee Exercise 3.20).

Theorem 3.14. If a statistical model satisfies the LAN condition (3.16),then for any loss function w, the asymptotic lower bound of the minimaxrisk holds:

lim infn→∞

infθn

supθ∈Θ

[w(√

In(θ) (θn − θ)) ]

≥∫ ∞

−∞

w(u)√2π

e−u2/2 du.

3.5. The Hellinger Distance

Though this section may seem rather technical, it answers an importantstatistical question. Suppose that the statistical experiment with the familyof densities p(x, θ) is such that p(x, θ0) = p(x, θ1) for some θ0 = θ1, whereθ0, θ1 ∈ Θ, and all x ∈ R. Clearly, no statistical observations can distinguishbetween θ0 and θ1 in this case.

Thus, we have to require that the family of probability densities{p(x, θ)

}

is such that for θ0 = θ1, the densities p( · , θ0) and p( · , θ1) are essentiallydifferent in some sense. How can the difference between these two densitiesbe measured?

For any family of densities{p(x, θ)

}, the set

{√p( · , θ), θ ∈ Θ

}presents

a parametric curve on the surface of a unit sphere in L2-space, the space ofsquare integrable functions in x variable. Indeed, for any θ, the square ofthe L2-norm is

∥∥√

p ( · , θ)∥∥22=

R

(√p(x, θ)

)2dx =

R

p(x, θ) dx = 1.

The Hellinger distance H(θ0, θ1) between p( · , θ0) and p( · , θ1) is defined as

(3.24) H(θ0, θ1) =∥∥√

p( · , θ0) −√p( · , θ1)

∥∥22, θ0, θ1 ∈ Θ.

Lemma 3.15. For the Hellinger distance (3.24), the following identitieshold:

(i) H(θ0, θ1) = 2(1 −

R

√p(x, θ0) p(x, θ1) dx

)

and

(ii) Eθ0

[√Z1(θ0, θ1)

]= 1 − 1

2H(θ0, θ1)

Page 43: Mathematical Statistics - 213.230.96.51:8090

32 3. Asymptotic Minimaxity

where Z1(θ0, θ1) = p(X, θ1)/p(X, θ0) denotes the likelihood ratio for a singleobservation.

Proof. (i) We write by definition

H(θ0, θ1) =

R

(√p(x, θ0) −

√p(x, θ1)

)2dx

=

R

p(x, θ0) dx +

R

p(x, θ1) dx − 2

R

√p(x, θ0) p(x, θ1) dx

= 2(1 −

R

√p(x, θ0) p(x, θ1) dx

).

(ii) By definition of Z1(θ0, θ1), we have

Eθ0

[√Z1(θ0, θ1)

]=

R

√p(x, θ1)

p(x, θ0)p(x, θ0) dx

=

R

√p(x, θ0) p(x, θ1) dx = 1 − 1

2H(θ0, θ1)

where the result of part (i) is applied. �

Lemma 3.16. Let Zn(θ0, θ1) be the likelihood ratio for a sample of size n,

Zn(θ0, θ1) =n∏

i=1

p(Xi, θ1)

p (Xi, θ0), θ0, θ1 ∈ Θ.

Then the following identity is true:

Eθ0

[√Zn(θ0, θ1)

]=(1 − 1

2H(θ0, θ1)

)n.

Proof. In view of independence of observations and Lemma 3.15 (ii), wehave

Eθ0

[√Zn(θ0, θ1)

]=

n∏

i=1

Eθ0

[√

p(Xi, θ1)

p(Xi, θ0)

]

=(1 − 1

2H(θ0, θ1)

)n. �

Assumption 3.17. There exists a constant a > 0 such that for any θ0, θ1 ∈Θ ⊆ R , the inequality H(θ0, θ1) ≥ a (θ0 − θ1)

2 holds. �

Example 3.18. If Xi’s are independent N(0, σ2

)random variables, then

by Lemma 3.15 (i),

H(θ0, θ1) = 2(1 − 1√

2πσ2

∫ ∞

−∞exp

{− (x− θ0)

2

4σ2− (x− θ1)

2

4σ2

}dx)

= 2(1 − exp

{− (θ0 − θ1)

2

8σ2

} 1√2πσ2

∫ ∞

−∞exp

{− (x− θ)2

2σ2

}dx)

Page 44: Mathematical Statistics - 213.230.96.51:8090

3.6. Maximum Likelihood Estimator 33

where θ = (θ0 + θ1)/2. As the integral of the probability density, the latterone equals 1. Therefore,

H(θ0, θ1) = 2(1 − exp

{− (θ0 − θ1)

2

8σ2

}).

If Θ is a bounded interval, then (θ0 − θ1)2/(8σ2) ≤ C with some constant

C > 0. In this case,

H(θ0, θ1) ≥ a (θ0 − θ1)2 , a = (1− e−C)/(4C σ2),

where we used the inequality (1− e−x) ≥ (1− e−C)x/C if 0 ≤ x ≤ C. �

3.6. Maximum Likelihood Estimator

In this section we study regular statistical experiments, which have contin-uous, bounded, and strictly positive Fisher information I(θ).

We call θ∗n an asymptotically minimax estimator, if for any loss functionw, and all sufficiently large n, the following inequality holds:

supθ∈Θ

[w( θ∗n − θ

ψn

) ]≤ r∗ < ∞

where ψn and r∗ are as in (3.7).

Recall from Section 1.1 that an estimator θ∗n is called the maximumlikelihood estimator (MLE), if for any θ ∈ Θ,

Ln(θ∗n) ≥ Ln(θ).

It turns out that Assumption 3.17 guarantees the asymptotic minimaxproperty of the MLE with ψn = 1/

√nI(θ). This result is proved in Theorem

3.20. We start with a lemma, the proof of which is postponed until the nextsection.

Lemma 3.19. Under Assumption 3.17, for any θ ∈ Θ and any c > 0, theMLE θ ∗

n satisfies the inequality

(√n |θ ∗

n − θ| ≥ c)≤ C exp

{− a c2/4

}

where the constant C = 2 + 3√

πI∗/a with I∗ = supθ∈Θ I(θ) < ∞.

At this point we are ready to prove the asymptotic minimaxity of theMLE.

Theorem 3.20. Under Assumption 3.17, the MLE is asymptotically min-imax. That is, for any loss function w and for any θ ∈ Θ, the normalizedrisk function of the MLE is finite,

lim supn→∞

[w(√

nI(θ) (θ ∗n − θ)

) ]= r∗ < ∞.

Page 45: Mathematical Statistics - 213.230.96.51:8090

34 3. Asymptotic Minimaxity

Proof. Since w(u) is an increasing function for u ≥ 0, we have

[w(√

nI(θ) (θ ∗n − θ)

) ]

≤∞∑

m=0

w(m+ 1)Pθ

(m ≤

√nI(θ) |θ ∗

n − θ| ≤ m+ 1)

≤∞∑

m=0

w(m+ 1)Pθ

(√n |θ ∗

n − θ| ≥ m/√

I(θ)).

By definition, the loss w is bounded from above by a power function, whilethe probabilities decrease exponentially fast by Lemma 3.19. Therefore, thelatter sum is finite. �

To find the sharp upper bound for the MLE, we make an additionalassumption that allows us to prove a relatively simple result. As shown inthe next theorem, the normalized deviation of the MLE from the true valueof the parameter,

√nI(θ) (θ∗n − θ ), converges in distribution to a standard

normal random variable. Note that this result is sufficient to claim theasymptotically sharp minimax property for all bounded loss functions.

Theorem 3.21. Let Assumption 3.17 and the LAN condition (3.16) hold.Moreover, suppose that for any δ > 0 and any c > 0, the remainder term in(3.16) satisfies the equality:

(3.25) limn→∞

supθ∈Θ

(sup

−c≤ t≤ c| εn(θ, t) | ≥ δ

)= 0.

Then for any x ∈ R, uniformly in θ ∈ Θ, the MLE satisfies the limitingequation:

limn→∞

(√n I(θ) (θ∗n − θ ) ≤ x

)= Φ(x)

where Φ denotes the standard normal cumulative distribution function.

Proof. Fix a large c such that c > |x |, and a small δ > 0 . Put

t∗n =√n I(θ) (θ∗n − θ ).

Define two random events

An = An(c, δ) ={

sup−2c≤ t≤ 2c

| εn(θ, t) | ≥ δ}

and

Bn = Bn(c) ={| t∗n | ≥ c

}.

Note that under the condition (3.25), we have that Pθ(An) → 0 asn → ∞. Besides, as follows from the Markov inequality and Theorem 3.20with w(u) = |u|,

Pθ(Bn) ≤ Eθ

[| t∗n |

]/c ≤ r∗/c.

Page 46: Mathematical Statistics - 213.230.96.51:8090

3.7. Proofs of Technical Lemmas 35

Let An and Bn denote the complements of the events A and B, respectively.We will use the following inclusion (for the proof see Exercise 3.21)

(3.26) An ∩ Bn ⊆{| t∗n − zn(θ) | ≤ 2

√δ}

or, equivalently,

(3.27){| t∗n − zn(θ) | > 2

√δ}⊆ An ∪ Bn

where zn(θ) is defined in (3.16). Elementary inequalities and (3.26) implythat

(t∗n ≤ x

)≤ Pθ

({t∗n ≤ x} ∩ An ∩ Bn

)+ Pθ

(An

)+ Pθ

(Bn

)

≤ Pθ

(zn(θ) ≤ x+ 2

√δ)+ Pθ

(An

)+ Pθ

(Bn

).

Taking the limit as n → ∞, we obtain that

(3.28) lim supn→∞

(t∗n ≤ x

)≤ Φ(x+ 2

√δ) + r∗/c

where we use the fact that zn(θ) is asymptotically standard normal. Next,

(t∗n ≤ x

)≥ Pθ

( {t∗n ≤ x

}∩{| t∗n − zn(θ) | ≤ 2

√δ} )

≥ Pθ

( {zn(θ) ≤ x − 2

√δ}∩{| t∗n − zn(θ) | ≤ 2

√δ} )

≥ Pθ

(zn(θ) ≤ x − 2

√δ)− Pθ

(| t∗n − zn(θ) | > 2

√δ)

≥ Pθ

(zn(θ) ≤ x − 2

√δ)− Pθ

(An

)− Pθ

(Bn

)

where at the last stage we have applied (3.27). Again, taking n → ∞, wehave that

(3.29) lim infn→∞

(t∗n ≤ x

)≥ Φ(x − 2

√δ) − r∗/c.

Now we combine (3.28) and (3.29) and take into account that c is howeverlarge and δ is arbitrarily small. Thus, the theorem follows. �

3.7. Proofs of Technical Lemmas

Proof of Lemma 3.12. Define zn(θ, A) = zn(θ) I(zn(θ) ≤ A) where A isa large positive constant. Note that zn(θ, A) converges in distribution as nincreases to z(θ, A) = z(θ) I(z(θ) ≤ A) with a standard normal z(θ). Thus,for any k , k = 1, 2, . . . , we can find a constant Ak and an integer nk so largethat for all n ≥ nk,

sup−c≤ t≤ c

[ ∣∣∣ exp

{zn(θ, Ak) t − t2/2

}− 1

∣∣∣]≤ 1/k.

Without loss of generality, we can assume that nk is an increasing sequence,nk → ∞ as k → ∞. Finally, put cn = Ak if nk ≤ n < nk+1. From thisdefinition, (3.19) and (3.20) follow. �

Page 47: Mathematical Statistics - 213.230.96.51:8090

36 3. Asymptotic Minimaxity

Proof of Lemma 3.13. First we will prove (3.21) for ξn(t). Note that

ξn(t), n = 1, . . . , are positive random variables. By Lemma 3.12, for anyt ∈ [−c, c], the convergence takes place

(3.30) ξn(t) → ξ(t) as n → ∞.

Hence, the expected value of ξn(t) converges as n → ∞,

(3.31) sup−c≤ t≤ c

∣∣Eθ

[ξn(t)

]− Eθ

[ξ(t)

] ∣∣ = sup−c≤ t≤ c

∣∣Eθ

[ξn(t)

]− 1

∣∣ → 0.

Choose an arbitrarily small δ > 0. There exists A(δ) such that uniformlyover t ∈ [−c, c] , the following inequality holds:

(3.32) Eθ

[ξ(t) I

(ξ(t) > A(δ)

) ]≤ δ.

Next, we can choose n = n(δ) so large that for any n ≥ n(δ) and allt ∈ [−c, c], the following inequalities are satisfied:

(3.33)∣∣Eθ

[ξn(t)

]− Eθ

[ξ(t)

] ∣∣ ≤ δ

and

(3.34)∣∣∣Eθ

[ξn(t) I

(ξn(t) ≤ A(δ)

) ]− Eθ

[ξ(t) I

(ξ(t) ≤ A(δ)

) ] ∣∣∣ ≤ δ.

To see that the latter inequality holds, use the fact that A(δ) is fixed and

ξn(t) → ξ(t) as n → ∞.

The triangle inequality and the inequalities (3.32)-(3.34) imply that forany A ≥ A(δ),

[ξn(t) I

(ξn(t) > A

) ]≤ Eθ

[ξn(t) I

(ξn(t) > A(δ)

) ]

= Eθ

[ξn(t) − ξn(t) I

(ξn(t) ≤ A(δ)

) ]

−Eθ

[ξ(t) − ξ(t) I

(ξ(t) ≤ A(δ)

)− ξ(t) I

(ξ(t) > A(δ)

) ]

≤∣∣Eθ

[ξn(t)

]− Eθ

[ξ(t)

] ∣∣

+∣∣∣Eθ

[ξn(t) I

(ξn(t) ≤ A(δ)

) ]− Eθ

[ξ(t) I

(ξ(t) ≤ A(δ)

) ] ∣∣∣

(3.35) +Eθ

[ξ(t) I

(ξ(t) > A(δ)

) ]≤ 3δ.

There are finitely many n such that n ≤ n(δ). For each n ≤ n(δ), we canfind An so large that for all A ≥ An, the following expected value is bounded:

[ξn(t) I

(ξn(t) > A )

]≤ 3δ. Put A0 = max

(A1, . . . , An(δ), A(δ)

). By

definition, for any A ≥ A0, and all t ∈ [−c, c], we have that

(3.36) supn≥ 1

[ξn(t) I

(ξn(t) > A

) ]≤ 3δ.

Page 48: Mathematical Statistics - 213.230.96.51:8090

3.7. Proofs of Technical Lemmas 37

Thus, the lemma follows for ξn(t).

The proof for ξn(t) is simpler. Similarly to ξn(t), the random vari-ables ξn(t), n = 1, . . . , are positive, and the convergence analogous to(3.30) is valid from the LAN condition, ξn(t) → ξ(t) as n → ∞ . But sinceEθ

[ξn(t)

]= 1, the convergence (3.31) of the expected values is replaced by

exact equality,∣∣Eθ

[ξn(t)

]− Eθ

[ξ(t)

] ∣∣ = 0. Therefore, (3.35) and (3.36)hold for ξn(t) with the upper bound replaced by 2δ, and the result of thelemma follows. �Proof of Lemma 3.19. The proof of this lemma (and Theorem 3.20) isdue to A.I. Sakhanenko (cf. Borovkov [Bor99], Chapter 2, §23). The proofis based on two results which we state and prove below.

Introduce the likelihood ratio

Zn(θ, θ + t) =n∏

i=1

p(Xi, θ + t)

p(Xi, θ).

Result 1. Put zn(t) =[Zn(θ, θ + t)

]3/4. Under Assumption 3.17, for any

θ, θ + t ∈ Θ, the following inequalities hold:

(3.37) Eθ

[√Zn(θ, θ + t)

]≤ exp

{− an t2/2

},

(3.38) Eθ

[zn(t)

]≤ exp

{− an t2/4

},

and

(3.39) Eθ

[z′n(t)

]≤ 3

4

√In(θ + t) exp

{− an t2/4

}

where z′n(t) = dzn(t)/dt.

Proof. From Lemma 3.16 and Assumption 3.17, we obtain (3.37),

[√Zn(θ, θ + t)

]=(1 − 1

2H(θ0, θ1)

)n

≤ exp{− n

2H(θ0, θ1)

}≤ exp

{− an t2/2

}.

To prove (3.38), we use the Cauchy-Schwarz inequality and (3.37),

[zn(t)

]= Eθ

[ (Zn(θ, θ + t)

)3/4 ]

= Eθ

[ (Zn(θ, θ + t)

)1/2 (Zn(θ, θ + t)

)1/4 ]

≤(Eθ

[Zn(θ, θ + t)

] )1/2 (Eθ

[ (Zn(θ, θ + t)

)1/2 ] )1/2

=(Eθ

[ (Zn(θ, θ + t)

)1/2 ] )1/2 ≤ exp{− an t2/4

}.

Here we used the identity (show!) Eθ

[Zn(θ, θ + t)

]= 1.

Page 49: Mathematical Statistics - 213.230.96.51:8090

38 3. Asymptotic Minimaxity

The proof of (3.39) requires more calculations. We write

[z′n(t)

]= Eθ

[ d

dtexp

{ 3

4

(Ln(θ + t) − Ln(θ)

) } ]

= Eθ

[ 34L′n(θ + t)

(Zn(θ, θ + t)

)3/4 ]

=3

4Eθ

[ (L′n(θ + t)

√Zn(θ, θ + t)

) (Zn(θ, θ + t)

)1/4 ]

≤ 3

4

(Eθ

[ (L′n(θ + t)

)2Zn(θ, θ + t)

]1/2 (Eθ

[ (Zn(θ, θ + t)

)1/2 ] )1/2

=3

4

(Eθ+t

[ (L′n(θ+ t)

)2 ] )1/2 (Eθ

[ (Zn(θ, θ+ t)

)1/2 ] )1/2 (by (3.8)

)

≤ 3

4

√In(θ + t) exp

{− an t2/4

}.

The last inequality sign is justified by the definition of the Fisher informa-tion, and (3.37). �

Result 2. Let Assumption 3.17 be true. Then for any positive constants γand c, the following inequality holds:

(sup

| t | ≥ c/√n

Zn(θ, θ + t) ≥ eγ)

≤ C e− 3γ/4 exp{−a c2/4}

where C = 2 + 3√

πI∗/a with I∗ = supθ∈Θ I(θ) < ∞.

Proof. Consider the case t > 0. Note that

supt≥ c/

√n

zn(t) = zn(c/√n) + sup

t> c/√n

∫ t

c/√nz′n(u) du

≤ zn(c/√n) + sup

t > c/√n

∫ t

c/√n|z′n(u)| du

≤ zn(c/√n) +

∫ ∞

c/√n|z′n(u)| du.

Applying Result 1, we find that

[sup

t≥ c/√n

zn(t)]≤ exp

{−a c2/4

}+

3

4

√n I∗

∫ ∞

c/√nexp

{−anu2/4

}du

= exp{− a c2/4

}+

3

4

√n I∗

√4π

an

∫ ∞

c√

a/2

1√2π

e−u2/2 du

= exp{− a c2/4

}+

3

2

√πI∗

a

[1 − Φ(c

√a/2)

]

Page 50: Mathematical Statistics - 213.230.96.51:8090

3.7. Proofs of Technical Lemmas 39

where Φ denotes the cumulative distribution function of the standard normalrandom variable. The inequality 1 − Φ(x) ≤ e−x2/2 yields

exp{− a c2/4

}+

3

2

√πI∗

a

[1 − Φ(c

√a/2)

]

≤(1 +

3

2

√πI∗

a

)exp

{− a c2/4

}=

1

2C exp

{− a c2/4

}.

The same inequality is true for t < 0 (show!),

[sup

t≤−c/√n

zn(t)]≤ 1

2C exp

{− a c2/4

}.

Further,

(sup

| t | ≥ c/√n

Zn(θ, θ + t) ≥ eγ)

≤ Pθ

(sup

t≥ c/√n

Zn(θ, θ + t) ≥ eγ)+ Pθ

(sup

t≤−c/√n

Zn(θ, θ + t) ≥ eγ)

= Pθ

(sup

t≥ c/√n

zn(t) ≥ e3 γ/4)+ Pθ

(sup

t≤−c/√n

zn(t) ≥ e3 γ/4),

and the Markov inequality P(X > x) ≤ E[X ] / x completes the proof,

≤ 1

2C e−3 γ/4 exp

{− a c2/4

}+

1

2C e−3 γ/4 exp

{− a c2/4

}

= C e−3 γ/4 exp{− a c2/4

}. �

Now we are in the position to prove Lemma 3.19. Applying the inclusion{√

n |θ ∗n − θ| ≥ c

}={

sup|t| ≥ c/

√n

Zn(θ, θ + t) ≥ sup|t|<c/

√n

Zn(θ, θ + t)}

⊆{

sup|t| ≥ c/

√n

Zn(θ, θ + t) ≥ Zn(θ, θ) = 1},

and using the Result 2 with γ = 0, we obtain

(√n |θ ∗

n − θ| ≥ c)

≤ Pθ

(sup

| t | ≥ c/√n

Zn(θ, θ + t) ≥ 1)

≤ C exp{− a c2/4

}. �

Page 51: Mathematical Statistics - 213.230.96.51:8090

40 3. Asymptotic Minimaxity

Exercises

Exercise 3.14. Verify that in Example 1.5, the estimator θn =√

X/n is

an asymptotically unbiased estimator of θ. Hint: Note that |√X/n − θ| =

|X/n− θ2| / |√X/n+ θ|, and thus, Eθ

[|√X/n− θ|

]≤ θ−1

[|X/n− θ2|

].

Now use the Cauchy-Schwarz inequality to finish off the proof.

Exercise 3.15. Show that the Hodges estimator defined by (3.3) is asymp-totically unbiased and satisfies the identities (3.4).

Exercise 3.16. Prove Theorem 3.7.

Exercise 3.17. Suppose the conditions of Theorem 3.7 hold, and a lossfunction w is such that w(1/2) > 0. Show that for any estimator θn thefollowing lower bound holds:

supθ∈Θ

[w(√

n (θn − θ)) ]

≥ 1

2w(1/2) p0 exp{z0}.

Hint: Use Theorem 3.7 and the inequality (show!)

w(√

n (θn − θ))+ w

(√n (θn − θ)− 1

)≥ w(1/2), for any θ ∈ Θ.

Exercise 3.18. Prove (3.14). Hint: First show this result for bounded lossfunctions.

Exercise 3.19. Prove the local asymptotic normality (LAN) for

(i) exponential model with the density

p(x , θ) = θ exp{− θ x} , x , θ > 0;

(ii) Poisson model with the probability mass function

p(x , θ) =θ x

x!exp{−θ} , θ > 0 , x ∈ {0 , 1 , . . . } .

Exercise 3.20. Prove Theorem 3.14. Hint: Start with a truncated lossfunction wC(u) = min(w(u), C) for some C > 0. Applying Theorem 3.11,obtain an analogue of (3.9) of the form

supθ∈R

[wC

(√nI(θ) (θn − θ)

) ]

≥ 1

2b

∫ b

−bE0

[wC

(√an θn − t

)exp

{zn(0) t− t2/2

} ]dt + on(1)

Page 52: Mathematical Statistics - 213.230.96.51:8090

Exercises 41

where an = nI(t/√nI(0)

), zn(0) is an asymptotically standard normal

random variable, and on(1) → 0 as n → ∞ . Then follow the lines ofTheorems 3.8 and 3.9, and, finally, let C → ∞ .

Exercise 3.21. Consider a distorted parabola zt − t2/2 + ε(t) where z hasa fixed value and −2c ≤ t ≤ 2c. Assume that the maximum of this functionis attained at a point t∗ that lies within the interval [−c, c]. Suppose that the

remainder term satisfies sup−2c≤ t≤ 2c | ε(t) | ≤ δ . Show that | t∗−z | ≤ 2√δ .

Page 53: Mathematical Statistics - 213.230.96.51:8090
Page 54: Mathematical Statistics - 213.230.96.51:8090

Chapter 4

Some IrregularStatistical Experiments

4.1. Irregular Models: Two Examples

As shown in the previous chapters, in regular models, for any estimator θn,the normalized deviation

√nI(θ) (θn − θ) either grows or stays bounded in

the minimax sense, as n increases. In particular, we have shown that the

quadratic risk Eθ

[ (θn − θ

)2]decreases not faster than at the rate O(n−1)

as n → ∞. This result has been obtained under some regularity conditions.The easiest way to understand their importance is to look at some irregularexperiments commonly used in statistics, for which the regularity conditionsare violated and the quadratic risk converges faster than O(n−1). We presenttwo examples below.

Example 4.1. Suppose the observations X1, . . . , Xn come from the uni-form distribution on [ 0 , θ ]. The family of probability densities can be de-fined as p(x , θ) = θ−1

I(0 ≤ x ≤ θ

). In this case, the MLE of θ is the

maximum of all observations (see Exercise 4.22), that is, θn = X(n) =

max(X1, . . . , Xn

). The estimator

θ ∗n =

n+ 1

nX(n)

is an unbiased estimator of θ with the variance

Varθ[θ ∗n

]=

θ2

n(n+ 2)= O

(n−2

)as n → ∞. �

43

Page 55: Mathematical Statistics - 213.230.96.51:8090

44 4. Some Irregular Statistical Experiments

Example 4.2. Consider a model with observations X1, . . . , Xn which havea shifted exponential distribution with the density

p(x , θ) = e− (x−θ)I(x ≥ θ), θ ∈ R.

It can be shown (see Exercise 4.23) that the MLE of θ is θn = X(1) =

min(X1, . . . , Xn), and that θ ∗n = X(1) − n−1 is an unbiased estimator of θ

with the variance Varθ[θ ∗n

]= n−2. �

The unbiased estimators in the above examples violate the Cramer-Raolower bound (1.2) since their variances decrease faster than O

(n−1

). Why

does it happen? In the next section we explain that in these examples theFisher information does not exist, and therefore, the Cramer-Rao inequalityis not applicable.

4.2. Criterion for Existence of the Fisher Information

For any probability density p(x , θ), consider the set{√

p( · , θ), θ ∈ Θ}.

It has been shown in Section 3.5 that for any fixed θ,√

p( · , θ) has a unitL2-norm, that is,

∥∥√

p ( · , θ)∥∥22=

R

(√p(x , θ)

)2dx = 1.

The existence of the Fisher information is equivalent to the smoothnessof this curve as a function of θ. We show that the Fisher information existsif this curve is differentiable with respect to θ in the L2-space.

Theorem 4.3. The Fisher information is finite if and only if the L2-normof the derivative

∥∥ ∂√

p ( · , θ) / ∂θ∥∥2is finite. The Fisher information is

computed according to the formula

I(θ) = 4∥∥ ∂

∂θ

√p( · , θ)

∥∥22.

Proof. The proof is straightforward:

∥∥ ∂

∂θ

√p( · , θ)

∥∥22=

R

( ∂

∂θ

√p(x, θ)

) 2dx

=

R

(∂p(x, θ)/∂θ

2√

p(x, θ)

) 2dx =

1

4

R

( ∂p(x, θ)/∂θ

p(x, θ)

) 2p(x, θ) dx

=1

4

R

(∂ ln p(x, θ)

∂θ

) 2p(x, θ) dx =

1

4I(θ). �

Example 4.4. The family of the uniform densities in Example 4.1 is notdifferentiable in the sense of Theorem 4.3. By definition,

∥∥ ∂

∂θ

√p( · , θ)

∥∥22= lim

Δθ→0(Δθ)−2

∥∥√p( · , θ +Δθ) −

√p( · , θ)

∥∥22

Page 56: Mathematical Statistics - 213.230.96.51:8090

4.3. Asymptotically Exponential Statistical Experiment 45

= limΔθ→0

(Δθ)−2∥∥ 1√

θ +ΔθI[ 0, θ+Δθ ]( · ) − 1√

θI[ 0, θ ]( · )

∥∥22.

A finite limit exists if and only if

∥∥ 1√θ +Δθ

I[ 0, θ+Δθ ]( · ) − 1√θI[ 0, θ ]( · )

∥∥22= O

((Δθ) 2

)as Δθ → 0.

However, the L2-norm decreases at a lower rate. To see this, assume Δθ ispositive and write

∥∥ 1√θ +Δθ

I[ 0, θ+Δθ ]( · ) − 1√θI[ 0, θ ]( · )

∥∥22

=

R

[ 1√θ +Δθ

I(0 ≤ x ≤ θ +Δθ

)− 1√

θI(0 ≤ x ≤ θ

) ] 2dx

=

∫ θ

0

( 1√θ +Δθ

− 1√θ

) 2dx +

∫ θ+Δθ

θ

( 1√θ +Δθ

) 2dx

=

(√θ −

√θ +Δθ

) 2

θ +Δθ+

Δθ

θ +Δθ

= 2(1 −

(1+Δθ/θ

)−1/2)

= Δθ/θ + o(Δθ/θ) � O((Δθ)2

)as Δθ → 0.

Hence, in this example,√

p( · , θ) is not differentiable as a function of θ, andthe finite Fisher information does not exist. �

A similar result is true for the shifted exponential model introduced inExample 4.2 (see Exercise 4.24).

If we formally write the Fisher information as I(θ) = ∞, then the right-hand side of the Cramer-Rao inequality (1.2) becomes zero, and there is nocontradiction with the faster rate of convergence.

4.3. Asymptotically Exponential Statistical Experiment

What do the two irregular models considered in the previous sections have incommon? First of all,

√p( · , θ) is not differentiable in the sense of Theorem

4.3, and∥∥√

p( · , θ +Δθ) −√

p( · , θ)∥∥22

= O(Δθ) as Δθ → 0. For theuniform model, this fact is verified in Example 4.4, while for the shiftedexponential distribution, it is assigned as Exercise 4.24.

Another feature that these models share is the limiting structure of thelikelihood ratio

Zn

(θ0, θ1

)= exp

{Ln

(θ1)− Ln(θ0)

}

=n∏

i=1

p(Xi, θ1)

p(Xi, θ0), θ0, θ1 ∈ Θ.

Page 57: Mathematical Statistics - 213.230.96.51:8090

46 4. Some Irregular Statistical Experiments

A statistical experiment is called asymptotically exponential if for anyθ ∈ Θ, there exists an asymptotically exponential random variable Tn suchthat

limn→∞

P(Tn ≥ τ ) = exp{− λ(θ) τ

}, τ > 0,

and either

(i) Zn

(θ, θ + t/n

)= exp

{− λ(θ) t

}I(t ≥ −Tn

)+ on(1)

or

(ii) Zn

(θ, θ + t/n

)= exp

{λ(θ) t

}I(t ≤ Tn

)+ on(1)

where λ(θ) is a continuous positive function of θ, θ ∈ Θ, and on(1) → 0in Pθ-probability as n → ∞.

Both uniform and shifted exponential models are special cases of theasymptotically exponential statistical experiment, as stated in Propositions4.5 and 4.6 below.

Proposition 4.5. The uniform statistical experiment defined in Example4.1 is asymptotically exponential with λ(θ) = 1/θ.

Proof. The likelihood ratio for the uniform distribution is

Zn

(θ , θ + t/n

)=( θ

θ + t/n

)n ∏

i=1

I(Xi ≤ θ + t/n

)

I(Xi ≤ θ

)

=( θ

θ + t/n

)n I(X(n) ≤ θ + t/n

)

I(X(n) ≤ θ

) .

Note that the event{X(n) ≤ θ

}holds with probability 1. Also,

( θ

θ + t/n

)n=(1 +

t/θ

n

)−n= exp

{− t/θ

}+ on(1) as n → ∞

and

I(X(n) ≤ θ + t/n

)= I(t ≥ −Tn

)where Tn = n (θ − X(n)).

It remains to show that Tn has a limiting exponential distribution. Indeed,

limn→∞

(Tn ≥ τ

)= lim

n→∞Pθ

(n (θ − X(n)) ≥ τ

)

= limn→∞

(X(n) ≤ θ − τ/n

)= lim

n→∞

( θ − τ/n

θ

)n= e−τ/θ. �

A similar argument proves the next proposition (see Exercise 4.26).

Proposition 4.6. The shifted exponential statistical experiment defined inExample 4.2 is asymptotically exponential with λ(θ) = 1.

Page 58: Mathematical Statistics - 213.230.96.51:8090

4.5. Sharp Lower Bound 47

4.4. Minimax Rate of Convergence

In accordance with the definition (3.7), the estimators in Examples 4.1 and4.2 have guaranteed rate of convergence ψn = O(n−1). Can this rate beimproved? That is, are there estimators that converge with faster rates?The answer is negative, and the proof is relatively easy.

Lemma 4.7. In an asymptotically exponential statistical experiment, thereexists a constant r∗ > 0 not depending on n such that for any estimator θn,the following lower bound holds:

lim infn→∞

supθ∈Θ

[ (n (θn − θ)

) 2 ] ≥ r∗.

Proof. Take θ0 ∈ Θ and θ1 = θ0 + 1/n ∈ Θ. Assume that property (ii)in the definition of an asymptotically exponential model holds. Then, as inthe proof of Lemma 3.4, we have

supθ∈Θ

[ (n (θn − θ)

) 2 ] ≥ maxθ∈{θ0, θ1}

[ (n (θn − θ)

) 2 ]

≥ n2

2Eθ0

[(θn − θ0)

2 + (θn − θ1)2 eλ(θ0)+ on(1) I

(1 ≤ Tn

) ]

≥ n2

2Eθ0

[ ((θn − θ0)

2 + (θn − θ1)2)I(Tn ≥ 1

) ],

since λ(θ0) + on(1) ≥ 0,

≥ n2

2

(θ1 − θ0)2

2Pθ0

(Tn ≥ 1

)

=1

4Pθ0

(Tn ≥ 1

)→ 1

4exp{−λ(θ0) } as n → ∞. �

Remark 4.8. The rate of convergence may be different from O(n−1) forsome other irregular statistical experiments, but those models are not asymp-totically exponential. For instance, the model described in Exercise 1.8 isnot regular (the Fisher information does not exist) if −1 < α ≤ 1. Therate of convergence in this model depends on α and is, generally speaking,different from O(n−1). �

4.5. Sharp Lower Bound

The constant r∗ = 14 exp{−λ(θ0) } in the proof of Lemma 4.7 is far from

being sharp. In the theorem that follows, we state a local version of the lowerbound with an exact constant for an asymptotically exponential experiment.

Page 59: Mathematical Statistics - 213.230.96.51:8090

48 4. Some Irregular Statistical Experiments

Theorem 4.9. Consider an asymptotically exponential statistical exper-iment. Assume that it satisfies property (ii) of the definition, and putλ0 = λ(θ0). Then for any θ0 ∈ Θ, any loss function w, and any estimator

θn, the following lower bound holds:

limδ→ 0

lim infn→∞

supθ : |θ−θ0|<δ

[w(n (θn − θ)

) ]≥ λ 0 min

y∈R

∫ ∞

0w(u−y) e−λ0 u du.

Proof. Choose a large positive number b and assume that n is so largethat b < δ n. Put wC(u) = min(w(u), C) where C is an arbitrarily large

constant. For any θn, we estimate the supremum over θ of the normalizedrisk by the integral

supθ : |θ−θ0|<δ

[wC

(n(θn − θ)

)]≥ 1

b

∫ b

0Eθ0+u/n

[wC

(n(θn − θ0 − u/n)

)]du

(4.1) =1

bEθ0

[ ∫ b

0wC

(n (θn − θ0) − u

) (eλ0uI

(u ≤ Tn

)+ on(1)

)du].

Here we applied the change of measure formula. Now, since wC is a boundedfunction,

1

bEθ0

[ ∫ b

0wC

(n (θn − θ0) − u

)on(1)

)du]= on(1),

and, continuing from (4.1), we obtain

=1

bEθ0

[ ∫ b

0wC

(n (θn − θ0) − u

)eλ0uI

(u ≤ Tn

)du]+ on(1)

≥ 1

bEθ0

[I(√

b ≤ Tn ≤ b) ∫ Tn

0wC

(n (θn − θ0) − u

)eλ0u du

]+ on(1),

which, after a substitution u = t+ Tn, takes the form

=1

bEθ0

[eλ0TnI

(√b ≤ Tn ≤ b

) ∫ 0

−Tn

wC

(n(θn − θ0)− Tn − t

)eλ0tdt

]+ on(1).

Put y = n (θn − θ0) − Tn, and continue:(4.2)

≥ 1

bEθ0

[eλ0Tn I

(√b ≤ Tn ≤ b

) ]miny∈R

∫ 0

−√bwC

(y − t

)eλ0t dt + on(1).

Note that

limn→∞

1

bEθ0

[eλ0Tn I

(√b ≤ Tn ≤ b

) ]=

1

b

∫ b

√bλ0 e

−λ0t+λ0t dt

=λ0

(b −

√b)

b= λ0

(1 − 1√

b

).

Page 60: Mathematical Statistics - 213.230.96.51:8090

Exercises 49

This provides the asymptotic lower bound for (4.2)

λ0

(1 − 1√

b

)miny∈R

∫ 0

−√bwC

(y − u

)eλ0u du

= λ0

(1 − 1√

b

)miny∈R

∫ √b

0wC

(u− y

)e−λ0 u du

where b and C can be taken however large, which proves the theorem. �

For the quadratic risk function, the lower bound in Theorem 4.9 can befound explicitly.

Example 4.10. If w(u) = u2, then

λ0 miny∈R

∫ ∞

0(u− y)2 e−λ0 u du = min

y∈R

[y2 − 2y/λ0 + 2/λ2

0

]

= miny∈R

[(y − 1/λ0)

2 + 1/λ20

]= 1/λ2

0.

By Proposition 4.5, for the uniform model, λ 0 = 1/θ0, hence, the exactlower bound equals θ 2

0 . For the shifted exponential experiment, accordingto Proposition 4.6, λ0 = 1 and thus, the lower bound is 1. �

Remark 4.11. In Exercise 4.27 we ask the reader to show that the lowerbound limiting constant of Theorem 4.9 is attainable in the uniform andshifted exponential models under the quadratic risk. The sharpness of thebound holds in general, for all asymptotically exponential models, but undersome additional conditions. �

Exercises

Exercise 4.22. Show that if X1, . . . , Xn are independent uniform(0, θ) ran-

dom variables, then (i) the MLE of θ is θn = X(n) = max(X1, . . . , Xn),

(ii) θ ∗n = (n+1)X(n)/n is an unbiased estimator of θ, and (iii) Varθ

[θ ∗n

]=

θ2/[n(n+ 2)].

Exercise 4.23. Consider independent observationsX1, . . . , Xn from a shiftedexponential distribution with the density

p (x , θ) = exp{−(x − θ) }, x ≥ θ, θ ∈ R .

Verify that (i) θn = X(1) = min(X1, . . . , Xn) is the MLE of θ, (ii) θ ∗n =

X(1) − 1/n is an unbiased estimator of θ, and (iii) Varθ[θ ∗n

]= 1/n2.

Exercise 4.24. Refer to Exercise 4.23. Prove that in the shifted exponentialmodel the Fisher information does not exist.

Page 61: Mathematical Statistics - 213.230.96.51:8090

50 4. Some Irregular Statistical Experiments

Exercise 4.25. Let p(x, θ) = c− if −1 < x < 0 and p(x, θ) = c+ if0 < x < 1. Assume that p(x, θ) = 0 if x is outside of the interval (−1, 1),and that the jump of the density at the origin equals θ , i.e., c+ − c− =θ , θ ∈ Θ ⊂ (0, 1). Use the formula in Theorem 4.3 to compute the Fisherinformation.

Exercise 4.26. Prove Proposition 4.6.

Exercise 4.27. Show that (i) in the uniform model (see Exercise 4.22),

limn→∞

Eθ0

[ (n(θ ∗

n − θ0))2 ]

= θ20.

(ii) in the shifted exponential model (see Exercise 4.23),

Eθ0

[ (n(θ ∗

n − θ0))2 ]

= 1.

Exercise 4.28. Compute explicitly the lower bound in Theorem 4.9 for theabsolute loss function w(u) = |u|.

Exercise 4.29. Suppose n independent observations have the shifted ex-ponential distribution with the location parameter θ. Using an argumentinvolving the Bayes risk, show that for any estimator θn, the quadratic min-imax risk is bounded from below (cf. Example 4.10),

infθn

supθ∈R

[ (n (θn − θ)

) 2 ] ≥ 1.

(i) Take a uniform(0, b) prior density and let Y = min(X(1), b). Verify thatthe posterior density is defined only if X(1) > 0, and is given by the formula

fb(θ |X1 , . . . , Xn) =n exp{n θ }

exp{nY } − 1, 0 ≤ θ ≤ Y.

(ii) Check that the posterior mean is equal to

θ ∗n (b) = Y − 1

n+

Y

exp{nY } − 1.

(iii) Argue that for any θ,√b ≤ θ ≤ b −

√b, the normalized quadratic risk

of the estimator θ ∗n (b) has the limit

limb→∞

[ (n (θ ∗

n (b) − θ))2 ]

= 1.

(iv) Show that

supθ∈R

[ (n (θn − θ)

)2 ] ≥ b− 2√b

binf√

b≤ θ≤ b−√bEθ

[ (n (θ ∗

n (b) − θ))2 ]

where the right-hand side is arbitrarily close to 1 for sufficiently large b.

Page 62: Mathematical Statistics - 213.230.96.51:8090

Chapter 5

Change-Point Problem

5.1. Model of Normal Observations

Consider a statistical model with normal observations X1 , . . . , Xn whereXi ∼ N (0 , σ2) if i = 1 , . . . , θ, and Xi ∼ N (μ , σ2) if i = θ + 1 , . . . , n. Aninteger parameter θ belongs to a subset Θ of all positive integers Z+,

Θ = Θα ={θ : αn ≤ θ ≤ (1− α)n , θ ∈ Z

+}

where α is a given number, 0 < α < 1/2. We assume that the standarddeviation σ and the expectation μ are known. Put c = μ/σ. This ratio iscalled a signal-to-noise ratio.

The objective is to estimate θ from observations X1 , . . . , Xn . The pa-rameter θ is called the change point, and the problem of its estimation istermed the change-point problem. Note that it is assumed that there are atleast αn observations obtained before and after the change point θ, that is,the number of observations of both kinds are of the same order O(n).

In the context of the change-point problem, the index imay be associatedwith the time at which the observationXi becomes available. This statisticalmodel differs from the models of the previous chapters in the respect thatit deals with non-homogeneous observations since the expected value of theobservations suffers a jump at θ.

The joint probability density of the observations has the form

p(x1, . . . , xn, θ

)=

θ∏

i=1

exp{− x2i /(2σ

2)}

√2πσ2

n∏

i=θ+1

exp{− (xi − μ)2/(2σ2)

}

√2πσ2

.

51

Page 63: Mathematical Statistics - 213.230.96.51:8090

52 5. Change-Point Problem

Denote by θ0 the true value of the parameter θ. We want to study thelog-likelihood ratio

Ln(θ) − Ln(θ0) = lnp (X1 , . . . , Xn , θ)

p (X1 , . . . , Xn , θ0)=

n∑

i=1

lnp(Xi, θ)

p(Xi, θ0).

Introduce a set of random variables

(5.1) εi =

{Xi/σ if 1 ≤ i ≤ θ0,

−(Xi − μ)/σ if θ0 < i ≤ n .

These random variables are independent and have N (0, 1) distribution withrespect to Pθ0-probability.

Define a stochastic process W (j) for integer-valued j ’s by

(5.2) W (j) =

⎧⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎩

θ0 <i≤ θ0 + j

εi if j > 0,

θ0 + j < i≤ θ0

εi if j < 0,

0 if j = 0,

where the εi’s are as in (5.1) for 1 ≤ i ≤ n; and for i outside of theinterval [1, n] , the εi’s are understood as supplemental independent standardnormal random variables. The processW (j) is called the two-sided Gaussianrandom walk (see Figure 2).

�W (j )

j01− θ0 n− θ0

Figure 2. A sample path of the two-sided Gaussian random walk W .

The next theorem suggests an explicit form of the likelihood ratio interms of the process W and the signal-to-noise ratio c.

Theorem 5.1. For any θ, 1 ≤ θ ≤ n, and any θ0 ∈ Θα, the log-likelihoodratio has the representation

Ln(θ) − Ln(θ0) = cW (θ − θ0) − c 2

2| θ − θ0 |

Page 64: Mathematical Statistics - 213.230.96.51:8090

5.1. Model of Normal Observations 53

where W (θ − θ0) is the two-sided Gaussian random walk in Pθ0-probabilitydefined by (5.2).

Proof. By definition, for any θ > θ0, the log-likelihood ratio is expressedas

Ln(θ) − Ln(θ0) = −θ∑

i=1

X2i

2σ2−

n∑

i= θ+1

(Xi − μ)2

2σ2

+

θ0∑

i=1

X2i

2σ2+

n∑

i= θ0 +1

(Xi − μ)2

2σ2= −

θ∑

i= θ0 +1

X2i

2σ2+

θ∑

i= θ0 +1

(Xi − μ)2

2σ2

σ

θ∑

i= θ0 +1

(− Xi

σ+

μ

)=

μ

σ

θ∑

i= θ0 +1

[−(Xi − μ

σ

)− μ

]

σ

θ∑

i= θ0 +1

εi −μ2

2σ2(θ − θ0) = cW (θ − θ0) − c2

2(θ − θ0)

with c = μ/σ. For θ < θ0, we get a similar formula,

Ln(θ) − Ln(θ0) =μ

σ

θ0∑

i= θ+1

( Xi

σ− μ

)

= c

θ0∑

i= θ+1

εi −c2

2(θ0 − θ) = cW (θ − θ0) − c2

2|θ − θ0|. �

Remark 5.2. The two-sided Gaussian random walk W plays a similar roleas a standard normal random variable Z in the regular statistical modelsunder the LAN condition (see Section 3.4.) The essential difference is thatthe dimension of W grows as n → ∞. �

The next result establishes a rough minimax lower bound.

Lemma 5.3. There exists a positive constant r∗ independent of n such that

infθn

maxθ∈Θα

[(θn − θ)2

]≥ r∗.

Proof. Take θ0, θ1 ∈ Θα such that θ1 − θ0 = 1. From Theorem 5.1, wehave that

Ln(θ1) − Ln(θ0) = cW (1) − c2

2where W (1) is a standard normal random variable in Pθ0-probability. Thus,

Pθ0

(Ln(θ1) − Ln(θ0) ≥ 0

)= Pθ0

(W (1) ≥ c/2

)= p0

with a positive constant p0 independent of n . Taking the same steps as inthe proof of Lemma 3.4, we obtain the result with r∗ = p0/4. �

Page 65: Mathematical Statistics - 213.230.96.51:8090

54 5. Change-Point Problem

Remark 5.4. Lemma 5.3 is very intuitive. Any estimator θn misses thetrue change point θ0 by at least 1 with a positive probability, which is not asurprise due to the stochastic nature of observations. Thus, the anticipatedminimax rate of convergence in the change-point problem should be O(1) asn → ∞. �

Remark 5.5. We can define a change-point problem on the interval [0, 1]by Xi ∼ N (0, σ2) if i/n ≤ θ, and Xi ∼ N (μ, σ2) if i/n > θ. On this scale,the anticipated minimax rate of convergence is O(n−1) for n → ∞. Notethat the convergence in this model is faster than that in regular models, and,though unrelated, is on the order of that in the asymptotically exponentialexperiments. �

5.2. Maximum Likelihood Estimator of Change Point

The log-likelihood function Ln(θ) in the change-point problem can be writ-ten as

Ln(θ) =

n∑

i=1

[− 1

2σ2

(Xi − μ I

(i > θ

) )2− 1

2ln (2πσ2)

], 1 ≤ θ ≤ n.

The maximum likelihood estimator (MLE) of θ exists and is unique withprobability 1 for any true value θ0. We denote the MLE of θ by

θn = argmax1≤ θ≤n

Ln(θ).

The goal of this section is to describe the exact large sample performanceof the MLE.

Introduce a stochastic process

L∞(j) = cW (j) − c 2

2| j | , j ∈ Z,

where the subscript “∞” indicates that j is unbounded in both directions.Define j ∗

∞ as the point of maximum of the process L∞(j),

j ∗∞ = argmax

j ∈Z

L∞(j),

and put

pj = Pθ0

(j ∗∞ = j

), j ∈ Z.

Note that the distribution of j ∗∞ is independent of θ0.

Theorem 5.6. For any θ0 ∈ Θα, and any loss function w, the risk of theMLE θn has a limit as n → ∞, independent of θ0,

limn→∞

Eθ0

[w(θn − θ0

) ]=∑

j ∈Z

w(j) pj .

Page 66: Mathematical Statistics - 213.230.96.51:8090

5.2. Maximum Likelihood Estimator of Change Point 55

Before we turn to proving the theorem, we state two technical lemmas,the proofs of which are postponed until the final section of this chapter.

Lemma 5.7. Let pj = Pθ0

(θn − θ0 = j

). There exist positive constants

b1 and b2 such that for any θ0 ∈ Θα, and any j satisfying 1 ≤ θ0 + j ≤ n,the following bounds hold:

pj ≤ pj ≤ pj + b1 e− b2 n.

Lemma 5.8. There exist positive constants b3 and b4 such that

pj ≤ b3 e− b4 | j | , j ∈ Z.

Proof of Theorem 5.6. Applying Lemma 5.7, we find that

Eθ0

[w(θn − θ0)

]=

j : 1≤ θ0 + j≤n

w(j) pj ≥∑

j : 1≤ θ0 + j≤n

w(j) pj

=∑

j ∈Z

w(j) pj −∑

j ∈Z \ [1−θ0 , n−θ0]

w(j) pj

where the latter sum is taken over integers j that do not belong to theset 1 ≤ θ0 + j ≤ n. As a loss function, w(j) does not increase fasterthan a polynomial function in | j |, while pj , in accordance with Lemma5.8, decreases exponentially fast. Besides, the absolute value of any j ∈Z \ [1− θ0 , n− θ0] is no less than αn. Thus,

limn→∞

j ∈Z \ [1−θ0 , n−θ0]

w(j) pj = 0

and

lim infn→∞

Eθ0

[w(θn − θ0)

]≥∑

j ∈Z

w(j) pj .

Similarly, we get the upper bound

Eθ0

[w(θn − θ0)

]=

j:1≤θ0+j≤n

w(j)pj ≤∑

j:1≤θ0+j≤n

w(j)[pj + b1e

−b2n]

≤∑

j ∈Z

w(j) pj + b1 e− b2 n

[max

j : 1≤ θ0 + j≤nw(j)

].

The maximum of w(j) does not grow faster than a polynomial function inn. That is why the latter term is vanishing as n → ∞, and

lim supn→∞

Eθ0

[w(θn − θ0)

]≤∑

j ∈Z

w(j) pj . �

Page 67: Mathematical Statistics - 213.230.96.51:8090

56 5. Change-Point Problem

5.3. Minimax Limiting Constant

In this section we will find the minimax limiting constant for the quadraticrisk function.

Lemma 5.9. For any θ0, the Bayes estimator θ ∗n with respect to the uniform

prior πn(θ) = 1/n, 1 ≤ θ ≤ n, satisfies the formula

θ ∗n = θ0 +

∑j : 1≤j+θ0≤n j exp

{cW (j) − c 2 | j | / 2

}

∑j : 1≤j+θ0≤n exp

{cW (j) − c 2 | j | / 2

} .

Proof. The proof is left as an exercise (see Exercise 5.30). �

Introduce a new random variable,

(5.3) ξ =

∑j ∈Z

j exp{cW (j) − c 2 | j | / 2

}

∑j ∈Z

exp{cW (j) − c 2 | j | / 2

} .

The next lemma asserts that θ ∗n − θ0 converges to ξ in the quadratic

sense. The proof of the lemma is deferred until Section 5.5.

Lemma 5.10. There exists a finite second moment r∗ = Eθ0

[ξ 2]and,

uniformly in θ0 ∈ Θα, the Bayes estimator θ ∗n satisfies the identity

limn→∞

Eθ0

[ (θ ∗n − θ0

) 2 ]= r∗.

Now we can show that the minimax quadratic risk of any estimator ofθ0 is bounded from below by r∗.

Theorem 5.11. Let r∗ be the constant defined in Lemma 5.10. For all largeenough n, and for any estimator θn, the following inequality takes place:

lim infn→∞

maxθ0∈Θα

Eθ0

[ (θn − θ0

) 2 ] ≥ r∗.

Proof. Assume, without loss of generality, that αn is an integer, and so isN = n − 2αn where N is the number of points in the parameter set Θα.As we typically deal with lower bounds, we estimate the maximum over Θα

from below by the mean value over the same set,

maxθ0 ∈Θα

Eθ0

[(θn − θ0)

2]≥ N−1

θ0 ∈Θα

Eθ0

[(θn − θ0)

2]

(5.4) ≥ N−1∑

θ0 ∈Θα

Eθ0

[(θN − θ0)

2]

where θN is the Bayes estimator with respect to the uniform prior distribu-tion πN (θ) = 1/N, αn ≤ θ ≤ (1− α)n.

Page 68: Mathematical Statistics - 213.230.96.51:8090

5.4. Model of Non-Gaussian Observations 57

Take an arbitrarily small positive β and define a set of integers Θα, β by

Θα, β ={θ : αn + βN ≤ θ ≤ n − (αn + βN )

}.

The set Θα, β contains no less than n− 2 (αn + βN ) = N − 2βN points.This set plays the same role of the “inner points” for Θα as Θα does for theoriginal set

{θ : 1 ≤ θ ≤ n

}. Applying Lemma 5.10 to θN , we have that,

uniformly in θ0 ∈ Θα, β,

limn→∞

Eθ0

[ (θN − θ0

)2 ]= r∗.

Substituting this limit into (5.4), we obtain that for any estimator θn,

lim infn→∞

maxθ0∈Θα

Eθ0

[(θn − θ0

)2] ≥ lim infn→∞

N−1∑

θ0∈Θα

Eθ0

[(θN − θ0

)2]

≥ lim infn→∞

N−1∑

θ0∈Θα,β

Eθ0

[(θN − θ0

)2]

≥ lim infn→∞

{N − 2βN

Nmin

θ0∈Θα,β

Eθ0

[(θN − θ0

)2]}= (1− 2β)r∗.

Since β is arbitrarily small, the result follows. �

Remark 5.12. Combining the results of Lemma 5.10 and Theorem 5.11, weobtain the sharp limit of the quadratic minimax risk. The Bayes estimatorcorresponding to the uniform prior distribution is asymptotically minimax.To the best of our knowledge, the exact value of r∗ is unknown. Nevertheless,it is possible to show that the MLE estimator discussed in Section 5.2 is notasymptotically minimax, since the limit of its quadratic risk determined inTheorem 5.6 is strictly larger than r∗. �

5.4. Model of Non-Gaussian Observations

Let p0(x) be a probability density which is positive on the whole real line,x ∈ R, and let μ = 0, be a fixed number. Assume that observations Xi ’shave a distribution with the density p0(x) if 1 ≤ i ≤ θ , and p0(x − μ) ifθ < i ≤ n. As in the previous sections, an integer θ is an unknown changepoint which belongs to Θα. In this section, we will describe the structureof the likelihood ratio to understand the difference from the case of normalobservations.

Denote by li the log-likelihood ratio associated with a single observationXi,

(5.5) li = lnp0(Xi − μ)

p0(Xi), 1 ≤ i ≤ n.

Page 69: Mathematical Statistics - 213.230.96.51:8090

58 5. Change-Point Problem

The two quantities

K± = −∫ ∞

−∞

[ln

p0(x ± μ)

p0(x)

]p0(x) dx

are called the Kullback-Leibler information numbers.

Let θ0 denote the true value of the parameter θ. Consider the randomvariables

(5.6) εi =

{li + K− if 1 ≤ i ≤ θ0,

− li +K+ if θ0 < i ≤ n.

Lemma 5.13. For any integer i, 1 ≤ i ≤ n, the random variables εi definedin (5.6) have an expected value of zero under Pθ0-distribution.

Proof. If 1 ≤ i ≤ θ0 , then by definition of K− ,

Eθ0 [ εi ] = Eθ0 [ li ] + K− = (−K−) +K− = 0 .

If θ0 < i ≤ n , then

Eθ0 [ εi ] = −∫ ∞

−∞

[ln

p0(x − μ)

p0(x)

]p0(x− μ) dx + K+

= −∫ ∞

−∞

[ln

p0(x)

p0(x+ μ)

]p0(x) dx + K+

=

∫ ∞

−∞

[ln

p0(x + μ)

p0(x)

]p0(x) d x + K+ = −K+ + K+ = 0. �

For any j ∈ Z, analogously to the Gaussian case (cf. (5.2)), introduce astochastic process W (j) by W (0) = 0,

W (j) =∑

θ0 <i≤ θ0 + j

εi if j > 0, and W (j) =∑

θ0 + j < i≤ θ0

εi if j < 0

where for 1 ≤ i ≤ n, εi ’s are the random variables from Lemma 5.13 thathave mean zero with respect to the Pθ0-distribution. For all other values ofi, the random variables εi’s are assumed independent with the zero expectedvalue. Note that W (j) is a two-sided random walk, which in general is notsymmetric. Indeed, the distributions of εi ’s may be different for i ≤ θ0 andi > θ0.

Define a constant Ksgn(i) as K+ for i > 0, and K− for i < 0.

Theorem 5.14. For any integer θ, 1 ≤ θ ≤ n, and any true change pointθ0 ∈ Θα, the log-likelihood ratio has the form

Ln(θ) − Ln(θ0) = lnp(X1 , . . . , Xn , θ)

p(X1 , . . . , Xn , θ0)= W (θ−θ0) −Ksgn(θ−θ0) | θ−θ0 |.

Page 70: Mathematical Statistics - 213.230.96.51:8090

5.5. Proofs of Lemmas 59

Proof. The joint density is computed according to the formula

p(X1 , . . . , Xn , θ) =∏

1≤ i≤ θ

p0(Xi)∏

θ < i≤n

p0(Xi − μ).

Therefore, if θ > θ0 , the likelihood ratio satisfies

Ln(θ) − Ln(θ0) = ln∏

θ0 <i≤ θ

p0(Xi)

p0(Xi − μ)=

θ0 <i≤ θ

(−li)

=∑

θ0 < i≤ θ

(εi − K+

)= W (θ − θ0) − K+ (θ − θ0).

In the case θ < θ0, we write

Ln(θ) − Ln(θ0) = ln∏

θ < i≤ θ0

p0(Xi − μ)

p0(Xi)=

θ < i≤ θ0

li

=∑

θ < i≤ θ0

(εi − K−

)= W (θ − θ0) − K− |θ − θ0|. �

From Theorem 5.14 we can expect that the MLE of θ0 in the non-Gaussian case possesses properties similar to that in the normal case. Thisis true under some restrictions on p0 (see Exercise 5.33).

5.5. Proofs of Lemmas

Proof of Lemma 5.7. Note that if j ∗∞ = j, and j is such that 1 ≤ θ0 +

j ≤ n, then θn − θ0 = j . Therefore, for any j satisfying 1 ≤ θ0 + j ≤ n,we have

pj = Pθ0

(j ∗∞ = j

)≤ Pθ0

(θn − θ0 = j

)= pj

= Pθ0

(θn − θ0 = j, 1 ≤ j ∗

∞ + θ0 ≤ n)+ Pθ0

(θn − θ0 = j, j ∗

∞ + θ0 ∈ [1, n])

≤ pj + Pθ0

(j ∗∞ + θ0 ∈ [1, n]

)= pj + Pθ0

(j ∗∞ ≤ −θ0

)+ Pθ0

(j ∗∞ ≥ n+ 1− θ0

)

≤ pj + Pθ0

(j ∗∞ ≤ −αn

)+ Pθ0

(j ∗∞ ≥ αn

)≤ pj +

k≤−αn

pk +∑

k≥αn

pk

= pj + 2∑

k≥αn

pk ≤ pj + 2b3∑

k≥αn

e−b4k = pj + b1e−b2n

where we have applied Lemma 5.8. The constants b1 and b2 are independentof n, b1 = 2 b3/(1− exp{−b4}) and b2 = b4 α. �Proof of Lemma 5.8. For a positive integer j we can write

pj = Pθ0

(j ∗∞ = j

)≤ Pθ0

(j ∗∞ ≥ j

)≤ Pθ0

(maxk≥ j

[cW (k)− c2k/2

]≥ 0

)

≤∑

k≥ j

Pθ0

(cW (k) − c2k/2 ≥ 0

)=∑

k≥ j

Pθ0

(W (k)/

√k ≥ c

√k/2

)

Page 71: Mathematical Statistics - 213.230.96.51:8090

60 5. Change-Point Problem

=∑

k≥ j

[1 − Φ

(c√k/2) ]

≤∑

k≥ j

exp{− c2 k/8 } ≤ b3 exp{− b4 j}

with the positive constants b3 = 1/(1− exp{−c 2/8}) and b4 = c 2/8.

In the above, we have estimated the probability Pθ0

(j ∗∞ ≥ j

)using the

fact that cW (j) − c 2 j/2 = 0 at j = 0, which implies that at the pointof the maximum cW (j ∗

∞) − c 2 j ∗∞/2 ≥ 0. Next we applied the fact that

W (k)/√k has the standard normal distribution with the c.d.f. Φ(x). In the

last step, the inequality 1− Φ(x) ≤ exp{−x2/2} is used.

A similar upper bound for the probabilities pj holds for negative j, sothat

pj ≤ b3 exp{− b4 | j | }, j ∈ Z . �Proof of Lemma 5.10. First we show that uniformly in θ0 ∈ Θα, thereexists a limit in Pθ0-probability

limn→∞

(θ ∗n − θ0

)= ξ.

To see that ξ is well-defined, it suffices to show that the denominatorin the definition (5.3) of ξ is separated away from zero. Indeed, at j = 0the contributing term equals 1. We want to demonstrate that the infinitesums in the numerator and denominator converge in probability. That is,for however small the positive ε, uniformly in Pθ0-probability, the followinglimits exist:

(5.7) limn→∞

Pθ0

( ∑

j : j+ θ0 ∈ [1, n]

| j | exp{cW (j) − c 2 | j | / 2

}> ε

)= 0

and

(5.8) limn→∞

Pθ0

( ∑

j : j+ θ0 ∈ [1, n]

exp{cW (j) − c 2 | j | / 2

}> ε

)= 0.

Introduce a random variable:

(5.9) ζ = min{k : W (|j |) ≤ c | j |/ 4 for all j , | j | ≥ k

}.

Starting from ζ, the random walk W (j) does not exceed c | j | / 4. First ofall, note that the tail probabilities for ζ are decreasing exponentially fast.In fact,

Pθ0

(ζ ≥ k

)≤ Pθ0

( ⋃

j≥k

{W (j) ≥ cj/4

})+ Pθ0

( ⋃

j≤−k

{W (j) ≥ c|j|/4

})

≤ 2∑

j≥k

Pθ0

(W (j) ≥ cj/4

)= 2

j≥k

Pθ0

(W (j)/

√j ≥ c

√j/4)

≤ 2∑

j≥k

exp{− c2j/32

}≤ a1 exp

{− c2k/32

}= a1 exp

{− a2k

}

Page 72: Mathematical Statistics - 213.230.96.51:8090

5.5. Proofs of Lemmas 61

with a1 = 2/(1− exp{−c2/32}) and a2 = c2/32.

Next we verify (5.7). The limit (5.8) is shown analogously. We have

Pθ0

( ∑

j+θ0 ∈[1,n]|j| exp

{cW (j)− c2|j|/2

}> ε)

≤ Pθ0

( ∑

|j|≥αn

|j| exp{cW (j)− c2|j|/2

}> ε)

≤ Pθ0

( ∑

|j|≥αn

|j| exp{cW (j)− c2|j|/2

}> ε; ζ ≤ αn

)+ Pθ0

(ζ > αn

)

≤ Pθ0

( ∑

|j|≥αn

|j| exp{− c2|j|/4

}> ε; ζ ≤ αn

)+ Pθ0

(ζ > αn

)

≤ Pθ0

( ∑

|j|≥αn

|j| exp{− c2|j|/4

}> ε)+ Pθ0

(ζ > αn

)

where we have applied the definition of the random variable ζ. In the lattersum, the first probability is zero for all sufficiently large n as the probabilityof a non-random event. The second probability is decreasing exponentiallyfast as n → ∞, which proves (5.7).

Further, we check that the second moment of ξ is finite despite the factthat neither the numerator nor denominator is integrable (see Exercise 5.31).Thus, we prove that there exists a finite second moment

r ∗ = Eθ0 [ ξ2 ] < ∞.

Introduce the notation for the denominator in the formula (5.3) for therandom variable ξ,

D =∑

j∈Zexp

{cW (j) − c 2 | j |/2

}.

Involving the random variable ζ defined in (5.9), we write

| ξ | =∑

| j | ≤ ζ

| j |D−1 exp{cW (j) − c2 | j |/2

}

+∑

| j |>ζ

| j |D−1 exp{cW (j) − c 2 | j |/2

}.

Note that for any j, D−1 exp{ cW (j) − c 2 | j |/2 } ≤ 1. We substitute thisinequality into the first sum. In the second sum, we use the obvious factthat D > exp{ cW (0) } = 1. Hence, we arrive at the following inequalities:

| ξ | ≤ ζ 2 +∑

| j |>ζ

| j | exp{ cW (j) − c 2 | j |/2 }.

Page 73: Mathematical Statistics - 213.230.96.51:8090

62 5. Change-Point Problem

If | j | is larger than ζ, then we can bound W (j) from above by c | j | / 4 andfind that

| ξ | ≤ ζ 2 + 2∑

j > ζ

j exp{− c 2 j / 4

}

(5.10) ≤ ζ 2 + 2∑

j≥ 1

j exp{− c 2 j / 4

}= ζ 2 + a3

with a3 =∑

j≥ 1 j exp{−c2 j/4}. Because the tail probabilities of ζ de-crease exponentially fast, any power moment of ξ is finite, in particular,r∗ = Eθ0

[ξ2]< ∞.

Finally, we verify that θ ∗n − θ0 converges to ξ in the L2 sense, that is,

uniformly in θ0 ∈ Θα,

limn→∞

Eθ0

[ (θ ∗n − θ0 − ξ

) 2 ]= 0.

Apply the representation for the difference θ ∗n − θ0 from Lemma 5.9. Simi-

larly to the argument used to derive (5.10), we obtain that∣∣ θ ∗

n − θ0∣∣ ≤ ζ 2 + a3

with the same definitions of the entries on the right-hand side. Thus, thedifference

∣∣θ ∗

n − θn − ξ∣∣ ≤

∣∣θ ∗

n − θ0∣∣ +

∣∣ξ∣∣ ≤ 2

(ζ 2 + a3

)

where the random variable ζ 2 + a3 is square integrable and independentof n . As shown above, the difference θ ∗

n − θn − ξ converges to 0 in Pθ0-probability as n → ∞. By the dominated convergence theorem, this differ-ence converges to zero in the quadratic sense. �

Exercises

Exercise 5.30. Prove Lemma 5.9.

Exercise 5.31. Show that Eθ0

[exp

{cW (j) − c 2 | j | / 2

} ]= 1, for any

integer j. Deduce from here that the numerator and denominator in (5.3)have infinite expected values.

Exercise 5.32. Show that the Kullback-Leibler information numbers K±are positive. Hint: Check that −K± < 0. Use the inequality ln(1 + x) < x,for any x = 0.

Page 74: Mathematical Statistics - 213.230.96.51:8090

Exercises 63

Exercise 5.33. Suppose that Eθ0

[| li |5+δ

]< ∞ for a small δ > 0, where

li’s are the log-likelihood ratios defined in (5.5), i = 1, . . . , n, and θ0 denotesthe true value of the change point. Show that uniformly in θ0 ∈ Θα, thequadratic risk of the MLE θn of θ0 is finite for any n , that is,

Eθ0

[ (θn − θ0

)2 ]< ∞ .

Hint: Argue that if θn > θ0,

Pθ0

(θn − θ0 = m

)≤

∞∑

l=m

Pθ0

( l∑

i=1

εi ≥ K+ l)

where εi’s are as in (5.6). Now use the fact that if Eθ0

[| εi |5+δ

]< ∞, then

there exists a positive constant C with the property

Pθ0

( l∑

i=1

εi ≥ K+ l)

≤ C l−(4+δ).

This fact can be found, for example, in Petrov [Pet75], Chapter IX, Theo-rem 28.

Exercise 5.34. Consider the Bernoulli model with 30 independent obser-vations Xi that take on values 1 or 0 with the respective probabilities p i and1 − p i, i = 1 , . . . , 30. Suppose that p i = 0.4 if 1 ≤ i ≤ θ, and p i = 0.7if θ < i ≤ 30, where θ is an integer change point, 1 ≤ θ ≤ 30. Estimate θfrom the following set of data:

i Xi i Xi i Xi

1 0 11 0 21 12 0 12 1 22 13 1 13 1 23 14 0 14 0 24 05 1 15 1 25 16 0 16 0 26 07 0 17 0 27 18 1 18 1 28 19 1 19 1 29 110 0 20 0 30 0

Exercise 5.35. Suppose that in the change-point problem, observationsXi have a known c.d.f. F1(x), x ∈ R, for 1 ≤ i ≤ θ, and another knownc.d.f. F2(x), x ∈ R, for θ < i ≤ n. Assume that the two c.d.f.’s are notidentically equal. Suggest an estimator of the true change point θ0 ∈ Θα.Hint: Consider a set X such that

∫XdF1(x) =

∫XdF2(x), and introduce

indicators I(Xi ∈ X).

Page 75: Mathematical Statistics - 213.230.96.51:8090
Page 76: Mathematical Statistics - 213.230.96.51:8090

Chapter 6

Sequential Estimators

6.1. The Markov Stopping Time

Sequential estimation is a method in which the size of the sample is notpredetermined, but instead parameters are estimated as new observationsbecome available. The data collection is terminated in accordance with apredefined stopping rule.

In this chapter we consider only the model of Gaussian observations. Weaddress two statistical problems, using the sequential estimation approach.First, we revisit the change-point problem discussed in Chapter 5, and,second, we study the parameter estimation problem from a sample of arandom size in an autoregressive model. Solutions to both problems arebased on the concept of the Markov stopping time.

Let Xi, i = 1, 2, . . . , be a sequence of real-valued random variables. Forany integer t ≥ 1, and any real numbers ai and bi such that ai ≤ bi, considerthe random events

{Xi ∈ [ai , bi]

}, i = 1, . . . , t.

All countable intersections, unions and complements of these randomevents form a σ-algebra generated by the random variables X1, . . . , Xt. De-note this σ-algebra by Ft, that is,

Ft = σ{Xi ∈ [ai , bi] , i = 1, . . . , t

}.

All the random events that belong to Ft are called Ft-measurable. Weinterpret the integer t as time, and we call Ft a σ-algebra generated by theobservations Xi up to time t.

65

Page 77: Mathematical Statistics - 213.230.96.51:8090

66 6. Sequential Estimators

It is easily seen that these inclusions are true:

F1 ⊆ F2 ⊆ · · · ⊆ F

where F denotes the σ-algebra that contains all the σ-algebras Ft. The setof the ordered σ-algebras {Ft , t ≥ 1 } is called a filter.

Example 6.1. The random event {X21 + X2

2 < 1 } is F2-measurable.Indeed, this random event can be presented as the union of intersections:

∞⋃

m=1

i2+j2 <m2

({ |X1| < i/m } ∩ { |X2| < j/m }

). �

An integer-valued random variable τ , τ ∈ {1, 2, . . . }, is called a Markovstopping time (or, simply, stopping time) with respect to the filter {Ft, t ≥ 1 }if for any integer t, the random event { τ = t } is Ft-measurable.

Example 6.2. The following are examples of the Markov stopping times(for the proof see Exercise 6.37):(i) A non-random variable τ = T where T is a given positive integer number.(ii) The first time when the sequence Xi hits a given interval [a, b], that is,τ = min{ i : Xi ∈ [a, b] }.(iii) The minimum or maximum of two given Markov stopping times τ1 andτ2, τ = min(τ1, τ2) or τ = max(τ1, τ2).(iv) The time τ = τ1 + s for any positive integer s, where τ1 is a givenMarkov stopping time. �

Example 6.3. Some random times are not examples of Markov stoppingtimes (for the proof see Exercise 6.38):(i) The last time when the sequence Xi, 1 ≤ i ≤ n, hits a given interval[a, b], that is, τ = max{ i : Xi ∈ [a, b], 1 ≤ i ≤ n}.(ii) The time τ = τ1 − s for any positive integer s, where τ1 is a givenstopping time. �

Lemma 6.4. If τ is a stopping time, then the random events: (i) {τ ≤ t}is Ft-measurable, and (ii) {τ ≥ t} is Ft−1-measurable.

Proof. (i) We write {τ ≤ t} =⋃t

s=1 {τ = s}, where each event {τ = s}is Fs-measurable, and since Fs ⊆ Ft, it is Ft-measurable as well. Thus,{τ ≤ t} is Ft-measurable as the union of Ft-measurable events.(ii) The random event {τ ≥ t} is Ft−1-measurable as the complement of{τ < t} = {τ ≤ t− 1}, an Ft−1-measurable event. �

The next important result is known as Wald’s first identity.

Page 78: Mathematical Statistics - 213.230.96.51:8090

6.1. The Markov Stopping Time 67

Theorem 6.5. Let X1, X2, . . . , be a sequence of independent identically dis-tributed random variables with E[X1 ] < ∞. Then for any Markov stoppingtime τ such that E[ τ ] < ∞, the following identity holds:

E[X1 + · · ·+Xτ

]= E[X1 ]E[ τ ].

Proof. By definition,

E[X1 + · · ·+Xτ

]= E

[ ∞∑

t=1

(X1 + · · ·+Xt) I(τ = t)]

= E[X1 I(τ ≥ 1) +X2 I(τ ≥ 2) + · · ·+Xt I(τ ≥ t) + . . .

].

For a Markov stopping time τ, the random event {τ ≥ t} is Ft−1-measurableby Lemma 6.4, that is, it is predictable from the observations up to time t−1,X1, . . . , Xt−1, and is independent of the future observations Xt, Xt+1, . . . . Inparticular, Xt and I(τ ≥ t) are independent, and hence, E

[Xt I(τ ≥ t)

]=

E[X1 ]P( τ ≥ t ). Consequently,

E[X1 + · · ·+Xτ

]= E[X1 ]

∞∑

t=1

P(τ ≥ t) = E[X1 ]E[ τ ].

Here we used the straightforward fact that∞∑

t=1

P( τ ≥ t ) = P(τ = 1) + 2P(τ = 2) + 3P(τ = 3) + . . .

=∞∑

t=1

tP(τ = t) = E[ τ ]. �

Let τ be a Markov stopping time. Introduce a set of random events:

Fτ ={A ∈ F : A ∩ {τ = t} ∈ Ft for all t, t ≥ 1

}.

Lemma 6.6. The set Fτ is a σ-algebra, that is, this set is closed undercountable intersections, unions, and complements.

Proof. Suppose events A1 and A2 belong to Fτ . To show that Fτ is a σ-algebra, it suffices to show that the intersection A1∩A2, union A1∪A2, andcomplement A1 belong to Fτ . The same proof extends to countably manyrandom events.

Denote by B1 = A1 ∩ {τ = t} ∈ Ft and B2 = A2 ∩ {τ = t} ∈ Ft. Theintersection A1 ∩A2 satisfies

(A1 ∩A2) ∩ {τ = t} = B1 ∩ B2 ∈ Ft,

as the intersection of two Ft-measurable events. Also, the union A1 ∪A2 issuch that

(A1 ∪A2) ∩ {τ = t} = B1 ∪ B2 ∈ Ft,

Page 79: Mathematical Statistics - 213.230.96.51:8090

68 6. Sequential Estimators

as the union of two Ft-measurable events. As for the complement A1, notefirst that both events {τ = t} and A1 ∩ {τ = t} belong to Ft, therefore,

A1 ∩ {τ = t} = {τ = t} \ (A1 ∩ {τ = t})

= {τ = t} ∩A1 ∩ {τ = t} ∈ Ft,

as an intersection of two Ft-measurable events. �

The σ-algebra Fτ is referred to as a σ-algebra of random events measur-able up to the random time τ.

Lemma 6.7. The Markov stopping time τ is Fτ -measurable.

Proof. For any positive integer s, put A = {τ = s}. We need to show thatA ∈ Fτ . For all t we find that

A ∩ {τ = t} = {τ = s} ∩ {τ = t} = {τ = t} if s = t,

and is the empty set, otherwise. The set {τ = t} belongs to Ft by thedefinition of a stopping time. The empty set is Ft - measurable as well(refer to Exercise 6.36). Thus, by the definition of Fτ , the event A belongsto Fτ . �

Recall that we defined Ft as a σ-algebra generated by the random vari-ables Xi up to time t.

Lemma 6.8. The random variable Xτ is Fτ -measurable.

Proof. Take any interval [a, b] and define A = {Xτ ∈ [a, b] }. Note that

A =

∞⋃

s=1

({Xs ∈ [a, b] } ∩ {τ = s}

).

Then for all t, we have that

A ∩ {τ = t} =∞⋃

s=1

({Xs ∈ [a, b] } ∩ {τ = s}

)⋂{τ = t}

= {Xt ∈ [a, b] } ∩ {τ = t}.The latter intersection belongs to Ft because both random events belong toFt. Hence A is Fτ -measurable. �

Remark 6.9. The concept of the σ-algebra Fτ is essential in the sequentialanalysis. All parameter estimators constructed from sequential observationsare Fτ -measurable, that is, are based on observations X1, . . . , Xτ obtainedup to a random stopping time τ. �

Page 80: Mathematical Statistics - 213.230.96.51:8090

6.2. Change-Point Problem. Rate of Detection 69

6.2. Change-Point Problem. Rate of Detection

In this section we return to the change-point problem studied in Chapter 5,and look at it from the sequential estimation point of view. The statisticalsetting of the problem is modified. If previously all n observations wereavailable for estimation of the true change point θ0 ∈ Θα, in this section weassume that observations Xi’s arrive sequentially one at a time at momentsti = i where i = 1, . . . , n.

Define a filter{Ft, 1 ≤ t ≤ n

}of σ-algebras Ft generated by the

observations X1, . . . , Xt up to time t, 1 ≤ t ≤ n. Introduce T as a set of allMarkov stopping times with respect to this filter.

If a sequential estimator τn of the change point θ0 belongs to T , thatis, if τn is a Markov stopping time, then we call this estimator an on-linedetector (or just detector) and the estimation problem itself, the on-linedetection problem (or, simply, detection).

In the on-line detection problem, we use the same loss functions as in theregular estimation problems studied so far. For example, for the quadraticloss, the minimax risk of detection is defined as

rDn = infτn∈T

maxθ0∈Θα

Eθ0

[(τn − θ0)

2].

The crucial difference between the minimax risk rn in the previous chap-ters and rDn consists of restrictions on the set of admissible estimators. Inthe on-line detection, we cannot use an arbitrary function of observations.

Remark 6.10. In this section we focus our attention on the quadraticloss, even though, sometimes in practice, other loss functions are used. Forinstance, we can restrict the class of admissible detectors to a class Tγ definedby

(6.1) Tγ ={τn : τn ∈ T and max

θ0∈Θα

Pθ0

(τn ≤ θ0

)≤ γ

}

where γ is a given small positive number. The probability Pθ0

(τn ≤ θ0

)is

called the false alarm probability. The name is inherited from the military airdefense problems where θ is associated with the time of a target appearance,so that any detection of the target before it actually appears on the radarscreen is, indeed, a false alarm. The condition on detectors in Tγ requiresthat the false alarm probability is small, uniformly in θ0 ∈ Θα . Anothernatural criterion in detection, also rooted in the military concerns, is a so-called expected detection delay,

(6.2) Eθ0

[(τn − θ0)+

]= Eθ0

[(τn − θ0) I( τn > θ0 )

].

Page 81: Mathematical Statistics - 213.230.96.51:8090

70 6. Sequential Estimators

The expected detection delay generates a minimax risk

infτn∈Tγ

maxθ0∈Θα

Eθ0

[( τn − θ0 )+

].

Clearly, any additional constraints on the admissible detectors increasethe value of the minimax risk. And every additional restriction makes theproblem more difficult. �

Below we find the rate of convergence for the minimax quadratic risk ofdetection rDn for the Gaussian model, and define the rate-optimal detectors.

Assume that Xi ∼ N (0, σ2) if 1 ≤ i ≤ θ0, and Xi ∼ N (μ, σ2) if θ0 < i ≤n, where μ > 0 is known. Our goal is to show that there exists a Markovstopping time τ∗n such that its deviation away from the true value of θ0 hasthe magnitude O(lnn) as n → ∞. It indicates a slower rate of convergencefor the on-line detection as opposed to the estimation based on the entiresample. Recall that in the latter case, the rate is O(1).

Remark 6.11. Note that on the integer scale, the convergence with therate O(lnn) is not a convergence at all. This should not be surprising sincethe convergence rate of O(1) means no convergence as well. If we compressthe scale and consider the on-line detection problem on the unit interval[0, 1] with the frequency of observations n (see Remark 5.5), then the rateof convergence guaranteed by the Markov stopping time detectors becomes(lnn)/n. �

Theorem 6.12. In the on-line detection problem with n Gaussian observa-tions, there exists a Markov stopping time τ∗n and a constant r∗ independentof n such that the following upper bound holds:

maxθ0∈Θα

Eθ0

[ (τ∗n − θ0lnn

)2 ]≤ r∗.

Proof. The construction of the stopping time τ∗n is based on the idea ofaveraging. Roughly speaking, we partition the interval [1, n] and computethe sample means in each of the subintervals. At the lower end of theinterval, the averages are close to zero. At the upper end, they are close tothe known number μ, while in the subinterval that captures the true changepoint, the sample mean is something in-between.

Put N = b lnn where b is a positive constant independent of n that willbe chosen later. Define M = n/N . Without loss of generality, we assumethat N and M are integer numbers. Introduce the normalized mean valuesof observations in subintervals of length N by

Xm =1

μN

N∑

i=1

X(m−1)N+i , m = 1, . . . ,M.

Page 82: Mathematical Statistics - 213.230.96.51:8090

6.2. Change-Point Problem. Rate of Detection 71

Let m0 be an integer such that (m0 − 1)N + 1 ≤ θ0 ≤ m0N . Note that

Xm =1

N

N∑

i=1

I((m− 1)N + i > θ0

)+

1

cN

N∑

i=1

ε(m−1)N+i

where εi’s are independent standard normal random variables and c is thesignal-to-noise ratio, c = μ/σ. For any m < m0 the first sum equals zero,and for m > m0 this sum is 1. The value of the first sum at m = m0 is anumber between 0 and 1, and it depends on the specific location of the truechange point θ0. The second sum can be shown to be

1

cN

N∑

i=1

ε(m−1)N+i =1

c√N

Zm, m = 1, . . . ,M,

where the Zm’s are independent standard normal random variables underPθ0-probability.

Next we show that for sufficiently large n,

Pθ0

(max

1≤m≤M|Zm | ≥

√10 lnM

)≤ n−3.

Put y =√10 lnM > 1. Notice that the probability that the maximum is not

less than y equals the probability that at least one of the random variablesis not less than y, therefore, we estimate

P(

max1≤m≤M

|Zm | ≥ y)= P

( M⋃

m=1

{ |Zm | ≥ y })≤

M∑

m=1

P(|Zm | ≥ y

)

= 2M ( 1− Φ(y) ) ≤ 2√

2πy2M exp{−y2/2} ≤ M exp{−y2/2}

where Φ(y) denotes the cumulative distribution function of a N (0, 1) ran-dom variable. In the above we used the standard inequality 1 − Φ(y) ≤exp{−y2/2}/

√2πy2 if y > 1. Thus, we have

Pθ0

(max

1≤m≤M|Zm | ≥

√10 lnM

)≤ M exp{−10 lnM/2}

= M−4 = (n/(b lnn))−4 ≤ n−3.

Consider the random event

A ={

max1≤m≤M

|Zm | <√10 lnM

}.

We have just shown that, uniformly in θ0 ∈ Θα, the probability of A, thecomplement of A, is bounded from above,

(6.3) Pθ0

(A)≤ n−3.

Page 83: Mathematical Statistics - 213.230.96.51:8090

72 6. Sequential Estimators

Choose b = 103 c−2 . If the event A occurs, we have the inequalities

max1≤m≤M

∣∣∣1

c√N

Zm

∣∣∣ <1

c

√10 lnM/N

=1

c

√10 (lnn − ln lnn − ln b) / (b lnn) ≤

√10/(b c2) = 0.1.

Now we can finalize the description of the averaged observations Xm =Bm + ξm where Bm’s are deterministic with the property that Bm = 0if m < m0 , and Bm = 1 if m > m0. The random variables | ξm | =

|Zm / (c√N) | do not exceed 0.1 if the random event A holds.

We are ready to define the Markov stopping time that estimates thechange point θ0. Define an integer-valued random variable m∗ by

m∗ = min{m : Xm ≥ 0.9 , 1 ≤ m ≤ M

},

and formally put m∗ = M if Xm < 0.9 for all m. Under the random eventA, the minimal m∗ exists and is equal to either m0 or m0 + 1.

Introduce a random variable

(6.4) τ∗n = m∗N.

If t is an integer divisible by N , then the random event { τ∗n = t } is definedin terms of X1 , . . . , Xt/N , that is, in terms of X1 , . . . , Xt, which means thatτ∗n is Ft-measurable. Thus, τ∗n is a stopping time. We take τ∗n as the on-linedetector. The next step is to estimate its quadratic risk.

As shown above, the inclusion A ⊆ { 0 ≤ m∗ − m0 ≤ 1 } is true. Thedefinition of m0 implies the inequalities 0 ≤ τ∗n − θ0 ≤ 2N. We write

maxθ0∈Θα

Eθ0

[ ((τ∗n − θ0)/ lnn

)2 ]

= maxθ0∈Θα

(Eθ0

[ ((τ∗n − θ0)/ lnn

)2I(A)

]+ Eθ0

[ ((τ∗n − θ0)/ lnn

)2I(A)

] )

≤ maxθ0∈Θα

(Eθ0

[( 2N/ lnn )2 I(A)

]+ Eθ0

[(n/ lnn

)2I(A)

] )

≤(2N/ lnn

)2+(n/ lnn

)2n−3 ≤ 4 b2 + 2

where at the final stage we have applied (6.3) and the trivial inequality1/(n ln2 n) < 2 , n ≥ 2. Thus, the statement of the theorem follows withr∗ = 4 b2 + 2. �

Page 84: Mathematical Statistics - 213.230.96.51:8090

6.3. Minimax Limit in the Detection Problem. 73

6.3. Minimax Limit in the Detection Problem.

The rate lnn in the on-line detection which is guaranteed by Theorem 6.12is the minimax rate. We show in this section that it cannot be improved byany other detector.

Recall that T denotes the class of all Markov stopping times with respectto the filter generated by the observations.

Theorem 6.13. In the on-line detection problem with n Gaussian observa-tions, there exists a positive constant r∗ independent of n such that

lim infn→∞

infτn ∈T

maxθ0 ∈Θα

Eθ0

[ ( τn − θ0lnn

)2 ]≥ r∗.

Proof. Choose points t0, . . . , tM in the parameter set Θα such that tj −tj−1 = 3 b lnn, j = 1, . . . ,M, with a positive constant b independent of n.The exact value of b will be selected later. Here the number of points M isequal to M = n (1 − 2α)/(3 b lnn). We assume, without loss of generality,that M and b lnn are integers.

We proceed by contradiction and assume that the claim of the theoremis false. Then there exists a detector τn such that

limn→∞

max0≤ j≤M

Etj

[ ( τn − tjlnn

)2 ]= 0,

which implies that

limn→∞

max0≤ j≤M

Ptj

(| τn − tj | > b lnn

)= 0.

Indeed, by the Markov inequality,

Ptj

(| τn − tj | > b lnn

)≤ b−2

Etj

[ ( τn − tjlnn

)2 ].

Hence for all large enough n, the following inequalities hold:

(6.5) Ptj

(| τn − tj | ≤ b lnn

)≥ 3/4 , j = 0 , . . . ,M.

Consider the inequality for j = M. Then

1

4≥ PtM

(| τn − tM | > b lnn

)≥ PtM

( M−1⋃

j=0

{| τn − tj | ≤ b lnn

} )

=

M−1∑

j=0

PtM

(| τn − tj | ≤ b lnn

)

(6.6) =M−1∑

j=0

Etj

[ dPtM

dPtj

I(| τn − tj | ≤ b lnn

) ].

Page 85: Mathematical Statistics - 213.230.96.51:8090

74 6. Sequential Estimators

Indeed, if τn is close to one of tj , j = 0, . . . ,M − 1, then τn is distant fromtM , and the random events

{| τn − tj | ≤ b lnn

}are mutually exclusive.

The likelihood ratio has the form

dPtM

dPtj

= exp{ μ

σ

tM∑

i=tj+1

[−(Xi − μ

σ

)− μ

]}

= exp{c

tM∑

i=tj+1

εi −c2

2(tM − tj)

}

where c = μ/σ is the signal-to-noise ratio, and εi = −(Xi − μ)/σ have thestandard normal distribution with respect to the Ptj -probability. Note thatthe number of terms in the sum from tj + 1 to tM can be as large as O(n).Further, let

Bj ={| τn − tj | ≤ b lnn

}.

Thus, each expectation in (6.6) can be written as

Etj

[ dPtM

dPtj

I(| τn − tj | ≤ b lnn

) ]

= Etj

[exp

{c

tM∑

i=tj+1

εi −c2

2(tM − tj)

}I(Bj )

].

Put uj = tj + b lnn. The event Bj is Fuj -measurable because τn isa Markov stopping time. Hence Bj is independent of the observationsXuj+1, . . . , XtM . Equivalently, I(Bj) is independent of εi for i = uj+1, . . . , tM .Note also that

Etj

[exp

{c

tM∑

i=uj+1

εi−c2

2(tM −uj)

}]= exp

{ tM∑

i=uj+1

c2

2− c2

2(tM −uj)

}= 1.

We write

Etj

[ dPtM

dPtj

I(| τn − tj | ≤ b lnn

) ]

= Etj

[exp

{c

uj∑

i=tj+1

εi −c2

2(uj − tj)

}I(Bj )

]

= Etj

[exp

{c√b lnn Zj − c2

2b lnn

}I(Bj )

]

where Zj =∑uj

i=tj+1 εi/√b lnn is a standard normal random variable with

respect to the Ptj -probability,

≥ Etj

[exp

{c√b lnn Zj − c2

2b lnn

}I(Bj ) I(Zj ≥ 0 )

]

Page 86: Mathematical Statistics - 213.230.96.51:8090

6.4. Sequential Estimation in the Autoregressive Model 75

≥ exp{

− c2

2b lnn

}Ptj

(Bj ∩ {Zj ≥ 0}

).

Further, the probability of the intersection

Ptj

(Bj ∩ {Zj ≥ 0}

)= Ptj (Bj) + Ptj

(Zj ≥ 0

)− Ptj

(Bj ∪ {Zj ≥ 0}

)

≥ Ptj (Bj) + Ptj

(Zj ≥ 0

)− 1 ≥ 3

4+

1

2− 1 =

1

4.

In the last step we used the inequality (6.5) and the fact that Ptj

(Zj ≥ 0

)=

1/2.

Thus, if we choose b = c−2, then the following lower bound holds:

Etj

[ dPtM

dPtj

I(| τn − tj | ≤ b lnn

) ]

≥ 1

4exp

{− c2

2b lnn

}=

1

4√n.

Substituting this inequality into (6.6), we arrive at a contradiction,

1

4≥

M−1∑

j=0

1

4√n

=M

4√n

=n(1− 2α)

3b lnn 4√n

=1− 2α

12b

√n

lnn→ ∞ as n → ∞.

This implies that the statement of the theorem is true. �

6.4. Sequential Estimation in the Autoregressive Model

In the previous two sections we applied the sequential estimation method tothe on-line detection problem. In this section, we demonstrate this techniquewith another example, the first-order autoregressive model (also, termedautoregression). Assume that the observations Xi satisfy the equation

(6.7) Xi = θ Xi−1 + εi, i = 1, 2, . . .

with the zero initial condition, X0 = 0. Here εi’s are independent normalrandom variables with mean zero and variance σ2. The autoregression co-efficient θ is assumed bounded, −1 < θ < 1. Moreover, the true value ofthis parameter is strictly less than 1, θ0 ∈ Θα = {θ : | θ0 | ≤ 1− α } with agiven small positive number α.

The following lemma describes the asymptotic behavior of autoregres-sion. The proof of the lemma is moved to Exercise 6.42.

Lemma 6.14. (i) The autoregressive model admits the representation

Xi = εi + θ εi−1 + θ2 εi−2 + . . .+ θ i−2ε2 + θ i−1ε1 , i = 1, 2, . . . .

Page 87: Mathematical Statistics - 213.230.96.51:8090

76 6. Sequential Estimators

(ii) The random variable Xi is normal with the zero mean and variance

σ2i = Var[Xi ] = σ2 1− θ2i

1− θ2.

(iii) The variance of Xi has the limit

limi→∞

σ2i = σ2

∞ =σ2

1− θ2.

(iv) The covariance between Xi and Xi+j, j ≥ 0, is equal to

Cov[Xi, Xi+j ] = σ2 θj1− θ2i

1− θ2.

Our objective is to find an on-line estimator of the parameter θ. Beforewe do this, we first study the maximum likelihood estimator (MLE).

6.4.1. Heuristic Remarks on MLE. Assume that only n observationsare available, X1, . . . , Xn. Then the log-likelihood function has the form

Ln(θ) =

n∑

i=1

[− (Xi − θ Xi−1)

2

2σ2− 1

2ln(2πσ2)

].

Differentiating with respect to θ, we find the classical MLE θ ∗n of the au-

toregression coefficient θ:

θ ∗n =

∑ni=1 Xi−1Xi∑ni=1 X2

i−1

.

The MLE does not have a normal distribution, which is easy to show forn = 2,

θ∗2 =X0X1 + X1X2

X20 + X2

1

=X1X2

X21

=ε1 (θ0 ε1 + ε2)

ε21= θ0 +

ε2ε1

where θ0 is the true value of θ. The ratio ε2/ε1 has the Cauchy distribution(show!). Therefore, the expectation of the difference θ∗2 − θ0 does not exist.

For n > 2, the expectation of θ ∗n − θ0 exists but is not zero, so that

the MLE is biased. We skip the proofs of these technical and less importantfacts. What is more important is that θ ∗

n is asymptotically normal as n →∞. We will try to explain this fact at the intuitive level. Note that

θ ∗n =

∑ni=1 Xi−1Xi∑ni=1 X2

i−1

=X0(θ0X0 + ε1) +X1(θ0X1 + ε2) + · · ·+Xn−1(θ0Xn−1 + εn)

X20 +X2

1 + · · ·+X2n−1

(6.8) = θ0 +X1ε2 + · · ·+Xn−1εnX2

1 + · · ·+X2n−1

.

Page 88: Mathematical Statistics - 213.230.96.51:8090

6.4. Sequential Estimation in the Autoregressive Model 77

By Lemma 6.14 (iv), since |θ| < 1, the covariance between two remoteterms Xi and Xi+j decays exponentially fast as j → ∞. It can be shownthat the Law of Large Numbers (LLN) applies to this process exactly as inthe case of independent random variables. By the LLN, for all large n, wecan substitute the denominator in the latter formula by its expectation

E[X21 + · · ·+X2

n−1 ] =n−1∑

i=1

Var[Xi ] ∼ nσ2∞ = n

σ2

1 − θ20.

Thus, on a heuristic level, we may say that

√n (θ ∗

n − θ0) ∼√nX1 ε2 + · · ·+Xn−1 εn

nσ2/(1− θ20)

=1− θ20σ2

X1 ε2 + · · ·+Xn−1 εn√n

.

If the Xi’s were independent, then(X1ε2 + · · · + Xn−1εn

)/√n would

satisfy the Central Limit Theorem (CLT). It turns out, and it is far frombeing trivial, that we can work with the Xi’s as if they were independent,and the CLT still applies. Thus, the limiting distribution of this quotient isnormal with mean zero and the limiting variance

limn→∞

Var[ X1 ε2 + · · ·+Xn−1 εn√

n

]= lim

n→∞1

n

n−1∑

i=1

E[ (

Xiεi+1

)2 ]

= limn→∞

1

n

n−1∑

i=1

E[X2

i

]E[ε2i+1

]=

σ4

n(1− θ20)limn→∞

n−1∑

i=1

(1− θ2i0 )

=σ4

n(1− θ20)limn→∞

(n− 1− θ2n0

1− θ20

)=

σ4

1− θ20.

It partially explains why the difference√n (θ ∗

n−θ0) is asymptotically normalwith mean zero and variance

( 1− θ20σ2

)2 σ4

1− θ20= 1− θ20,

that is,√n (θ ∗

n − θ0) → N(0, 1− θ20

)as n → ∞.

Note that the limiting variance is independent of σ2, the variance of thenoise.

Page 89: Mathematical Statistics - 213.230.96.51:8090

78 6. Sequential Estimators

6.4.2. On-Line Estimator. After obtaining a general idea about the MLEand its asymptotic performance, we are ready to try a sequential estimationprocedure, termed an on-line estimation.

Note that from (6.8) the difference θ ∗n − θ0 can be presented in the form

θ ∗n − θ0 =

∑ni=2 υn,i εi with the weights υn,i = Xi−1/(X

21 + · · ·+X2

n−1 ).If the υn,i’s were deterministic, then the variance of the difference θ ∗

n − θ0would be

σ2n∑

i=2

υ2n,i = σ2/(X21 + · · ·+X2

n−1).

In a sense, the sum X21+· · ·+X2

n−1 plays the role of the information number:the larger it is, the smaller the variance.

The above argument brings us to an understanding of how to construct asequential estimator of θ, called an on-line estimator. Let us stop collectingdata at a random time τ when the sum X2

1 + · · ·+X2τ reaches a prescribed

level H > 0, that is, define the Markov stopping time τ by (see Exercise6.39)

τ = min{t : X2

1 + · · ·+X2t > H

}.

In the discrete case with normal noise, the overshoot X21 + · · ·+X2

t −His positive with probability 1. The stopping time τ is a random samplesize, and the level H controls the magnitude of its expected value, Eθ0 [ τ ]increases as H grows (see Exercise 6.39). Put

ΔH = H − (X21 + · · ·+X2

τ−1 ) and η =ΔH

Xτ.

The definition of η makes sense because the random variable Xτ differs fromzero with probability 1.

Define an on-line estimator of θ0 by

(6.9) θτ =1

H

( τ∑

i=1

Xi−1Xi + η Xτ+1

).

This is a sequential version of the MLE (6.8). Apparently, if ΔH (and,

respectively, η) were negligible, then θτ would be the MLE with n substituted

by τ. Note that θτ is not Fτ -measurable because it depends on one extraobservation, Xτ+1. This is the tribute to the discrete nature of the model.As shown below, due to this extra term, the estimator (6.9) is unbiased.

Lemma 6.15. The estimator θτ given by (6.9) is an unbiased estimator ofθ0, and uniformly over θ0 ∈ Θα, its variance does not exceed σ2/H.

Page 90: Mathematical Statistics - 213.230.96.51:8090

6.4. Sequential Estimation in the Autoregressive Model 79

Proof. First, we show that the estimator is unbiased. Note that

θτ =1

H

[ τ∑

i=1

Xi−1 ( θ0Xi−1 + εi ) + η ( θ0Xτ + ετ+1 )]

=1

H

[θ0

τ∑

i=1

X2i−1 +

τ∑

i=1

Xi−1 εi + θ0 η Xτ + η ετ+1

].

By definition, η Xτ = ΔH and ΔH +∑τ

i=1 X2i−1 = H, hence,

(6.10) θτ = θ0 +1

H

[ τ∑

i=1

Xi−1 εi + η ετ+1

].

Therefore, the bias of θτ is equal to

(6.11) Eθ0

[θτ − θ0

]=

1

H

(Eθ0

[ τ∑

i=1

Xi−1 εi

]+ Eθ0

[η ετ+1

] ),

and it suffices to show that both expectations are equal to zero. Start withthe first one:

Eθ0

[ τ∑

i=1

Xi−1 εi

]= Eθ0

[X1 ε2 I(τ = 2) + (X1 ε2 +X2 ε3) I(τ = 3) + . . .

]

= Eθ0

[ ∞∑

i=1

Xi−1 εi I(τ ≥ i)]=

∞∑

i=1

Eθ0

[Xi−1 εi I(τ ≥ i)

].

We already know that the random variable I(τ ≥ i) is Fi−1-measurable, andso is Xi−1. The random variable εi is independent of Fi−1, which yields thateach term in this infinite sum is equal to zero,

Eθ0

[Xi−1 εi I(τ ≥ i)

]= Eθ0

[Xi−1 I(τ ≥ i)

]Eθ0

[εi]= 0.

The second expectation Eθ0

[η ετ+1

]requires more attention. Note that η is

Fτ -measurable. Indeed, for any integer t and for any a ≤ b, the intersectionof the random events{a ≤ η ≤ b

}∩{τ = t

}={aXt ≤ H−(X2

1 + · · · + X2t−1) ≤ bXt

}∩{τ = t

}

is Ft-measurable, because both random events on the right-hand side areFt-measurable. Hence for any t, the random variable η I(τ = t) is Ft-measurable. The variable εt+1, on the other hand, is independent of Ft.Thus,

Eθ0

[η ετ+1

]=

∞∑

t=1

Eθ0

[η εt+1 I(τ = t)

]

=

∞∑

t=1

Eθ0

[η I(τ = t)

]Eθ0

[εt+1

]= 0.

Page 91: Mathematical Statistics - 213.230.96.51:8090

80 6. Sequential Estimators

It follows that either sum in (6.11) is equal to zero, which means that the

estimator θτ is unbiased.

Next, we want to estimate the variance of θτ . Using the representation(6.10) of θτ , we need to verify that

Eθ0

[ ( τ∑

i=1

Xi−1 εi + η ετ+1

)2 ] ≤ σ2H.

The left-hand side of this inequality is equal to

(6.12) Eθ0

[ ( τ∑

i=1

Xi−1 εi)2

+ 2( τ∑

i=1

Xi−1 εi)η ετ+1 + η2 ε2τ+1

].

Consider the last term. We know that η is Fτ -measurable. Hence

Eθ0

[η2 ε2τ+1

]=

∞∑

t=1

Eθ0

[η2 ε2t+1 I(τ = t)

]

=

∞∑

t=1

Eθ0

[η2 I(τ = t)

]Eθ0

[ε2t+1

]

= σ2∞∑

t=1

Eθ0

[η2 I(τ = t)

]= σ2

Eθ0

[η2].

In a similar way, we can show that the expectation of the cross-term in(6.12) is zero. The analysis of the first term, however, takes more steps. Itcan be written as

Eθ0

[ ( τ∑

i=1

Xi−1 εi)2 ]

= Eθ0

[(X1 ε2)

2I(τ = 2) + (X1 ε2 +X2 ε3)

2I(τ = 3)

+ (X1 ε2 +X2 ε3 +X3 ε4)2I(τ = 4) + . . .

]= Eθ0

[X2

1 ε22 I(τ = 2)

+(X2

1 ε22 + X2

2 ε23

)I(τ = 3) +

(X2

1ε22 + X2

2 ε23 + X2

3 ε24

)I(τ = 4) + · · ·

]

+2Eθ0

[(X1 ε2) (X2 ε3) I(τ ≥ 3) + (X1 ε2 +X2 ε3) (X3 ε4) I(τ ≥ 4) + · · ·

]

= E1 + 2 E2where

E1 = Eθ0

[X2

1 ε22 I(τ = 2) +

(X2

1 ε22 + X2

2 ε23

)I(τ = 3)

+(X2

1ε22 + X2

2 ε23 + X2

3 ε24

)I(τ = 4) + · · ·

]

= σ2Eθ0

[X2

1 I(τ = 2)+ (X21+X2

2 ) I(τ = 3)+ (X21+X2

2+X23 ) I(τ = 4)+ · · ·

]

= σ2Eθ0

[ τ∑

i=1

X2i−1

]

Page 92: Mathematical Statistics - 213.230.96.51:8090

6.4. Sequential Estimation in the Autoregressive Model 81

and

E2 = Eθ0

[(X1 ε2) (X2 ε3) I(τ ≥ 3) + (X1 ε2 +X2 ε3) (X3 ε4) I(τ ≥ 4) + · · ·

]

= Eθ0

[(X1 ε2)(X2) I(τ ≥ 3)

]Eθ0

[ε3]

+Eθ0

[(X1 ε2 +X2 ε3)(X3) I(τ ≥ 3)

]Eθ0

[ε4]+ · · · = 0.

Combining all these estimates, we find that the expectation in (6.12) is equalto

Eθ0

[ ( τ∑

i=1

Xi−1 εi)2

+ 2( τ∑

i=1

Xi−1 εi)η ετ+1 + η2 ε2τ+1

]

= σ2Eθ0

[ τ∑

i=1

X2i−1

]+ σ2

Eθ0

[η2].

From the definition of ΔH,∑τ

i=1X2i−1 = H − ΔH. Also, recall that η =

H/Xτ . Thus, we continue

= σ2Eθ0

[H −ΔH + η2

]= σ2

(H − Eθ0

[ΔH − η2

] )

= σ2(H − Eθ0

[ΔH − (ΔH/Xτ )

2] )

= σ2(H − Eθ0

[ΔH ( 1−ΔH/X2

τ )] )

.

Note that at the time τ − 1, the value of the sum X21 + · · · + X2

τ−1 doesnot exceed H, which yields the inequality ΔH ≥ 0. In addition, by thedefinition of τ ,

∑τi=1X

2i−1 +X2

τ > H, which implies that

ΔH = H −τ∑

i=1

X2i−1 < X2

τ .

Hence, ΔH/X2τ < 1. Thus, ΔH and ( 1 − ΔH/X2

τ ) are positive randomvariables with probability 1, and therefore,

Eθ0

[ ( τ∑

i=1

Xi−1 εi + η ετ+1

)2 ]

�(6.13) = σ2(H − Eθ0

[ΔH ( 1−ΔH/X2

τ )] )

≤ σ2H.

The statement of Lemma 6.15 is true for any continuous distribution ofthe noise εi, if it has the zero mean and variance σ2. The continuity of thenoise guarantees that the distribution of Xi is also continuous, and thereforeη = ΔH/Xτ is properly defined. If we assume additionally that the noisehas a bounded distribution, that is, | εi | ≤ C0 for some positive constantC0, then for any i the random variables |Xi |’s turn out to be bounded aswell. Under this additional assumption, we can get a lower bound on thevariance of θτ .

Page 93: Mathematical Statistics - 213.230.96.51:8090

82 6. Sequential Estimators

Theorem 6.16. If | εi | ≤ C0, E[ εi ] = 0, and Var[ εi ] = σ2, then

Varθ0[θτ]≥ σ2

H− σ2C2

0

4H2 ( 1− | θ0 | )2.

Proof. From Lemma 6.14 (i), we find that

|Xi | ≤ | εi | + | θ | | εi−1 | + | θ2 | | εi−2 | + · · · + | θi−2 | | ε2 | + | θi−1 | | ε1 |

≤ C0

(1 + | θ | + | θ2 | + · · · + | θi−2 | + | θi−1 |

)≤ C0 / ( 1− | θ0 | ).

In the proof of Lemma 6.15 we have shown (see (6.9)-(6.13)) that

Varθ0[θτ]=

σ2

H2

(H − Eθ0

[ΔH(1−ΔH/X2

τ )])

where 0 ≤ ΔH ≤ X2τ . Now, the parabola ΔH(1 − ΔH/X2

τ ) is maximizedat ΔH = X2

τ /2, and therefore ΔH(1−ΔH/X2τ ) ≤ X2

τ /4. Finally, we havethat

Eθ0

[ΔH(1−ΔH/X2

τ )]≤ 1

4Eθ0

[X2

τ

]≤ C2

0

4(1− |θ0|)2.

The result of the theorem follows. �

Remark 6.17. Note that the bound for the variance of θτ in Theorem 6.16is pointwise, that is, the lower bound depends on θ0. To declare a uniformbound for all θ0 ∈ Θα = {θ : | θ0 | ≤ 1− α }, we take the minimum of bothsides:

infθ0∈Θα

Varθ0[θτ]≥ σ2

H− σ2C2

0

4H2 α2.

Combining this result with the uniform upper bound in Lemma 6.15, we getthat as H → ∞,

infθ0∈Θα

Varθ0[θτ]=

σ2

H

(1 +O(H−1)

). �

Page 94: Mathematical Statistics - 213.230.96.51:8090

Exercises 83

Exercises

Exercise 6.36. Show that an empty set is F -measurable.

Exercise 6.37. Check that the random variables τ defined in Example 6.2are stopping times.

Exercise 6.38. Show that the variables τ specified in Example 6.3 arenon-stopping times.

Exercise 6.39. Let Xi’s be independent identically distributed randomvariables, and let τ be defined as the first time when the sum of squaredobservations hits a given positive level H,

τ = min{ i : X21 + · · ·+X2

i > H }.(i) Show that τ is a Markov stopping time.(ii) Suppose E[X2

1 ] = σ2. Prove that E[ τ ] > H/σ2. Hint: Use Wald’s firstidentity.

Exercise 6.40. Prove Wald’s second identity formulated as follows. Sup-pose X1, X2, . . . are independent identically distributed random variableswith finite mean and variance. Then

Var[X1 + · · ·+Xτ − E[X1 ] τ

]= Var[X1 ]E[ τ ].

Exercise 6.41. Suppose that Xi’s are independent random variables, Xi ∼N (θ, σ2). Let τ be a stopping time such that Eθ[ τ ] = h, where h is adeterministic constant.(i) Show that θτ = (X1 + · · · +Xτ )/h is an unbiased estimator of θ. Hint:Apply Wald’s first identity.(ii) Show that

Varθ[ θτ ] ≤2σ2

h+

2θ2Varθ[ τ ]

h2.

Hint: Apply Wald’s second identity.

Exercise 6.42. Prove Lemma 6.14.

Page 95: Mathematical Statistics - 213.230.96.51:8090
Page 96: Mathematical Statistics - 213.230.96.51:8090

Chapter 7

Linear ParametricRegression

7.1. Definitions and Notations

An important research area in many scientific fields is to find a functionalrelation between two variables, say X and Y , based on the experimentaldata. The variable Y is called a response variable (or, simply, response),while X is termed an explanatory variable or a predictor variable.

The relation between X and Y can be described by a regression equation

(7.1) Y = f(X) + ε

where f is a regression function, and ε is a N (0, σ2) random error indepen-dent of X. In this chapter we consider only parametric regression modelsfor which the algebraic form of the function f is assumed to be known.

Remark 7.1. In this book we study only simple regressions where there isonly one predictor X. �

Let f be a sum of known functions g0, . . . , gk with unknown regressioncoefficients θ0, . . . , θk,

(7.2) f = θ0 g0 + θ1 g1 + · · · + θk gk.

It is convenient to have a constant intercept θ0 in the model, thus, withoutloss of generality, we assume that g0 = 1. Note that the function f is linearin parameters θ0, . . . , θk.

85

Page 97: Mathematical Statistics - 213.230.96.51:8090

86 7. Linear Parametric Regression

Plugging (7.2) into the regression equation (7.1), we obtain a generalform of a linear parametric regression model

(7.3) Y = θ0 g0(X) + θ1 g1(X) + · · · + θk gk(X) + ε

where the random error ε has a N (0, σ2) distribution and is independent ofX.

Example 7.2. Consider a polynomial regression, for which g0(X) = 1,g1(X) = X, . . . , gk(X) = Xk. Here the response variable Y is a polynomialfunction of X corrupted by a random error ε ∼ N (0, σ2),

Y = θ0 + θ1X + θ2X2 + · · · + θk X

k + ε. �

Suppose the observed data consist of n pairs of observations (xi, yi), i =1, . . . , n. The collection of the observations of the explanatory variable X,denoted by X = {x1, . . . , xn}, is called a design. According to (7.1), thedata points satisfy the equations

(7.4) yi = f(xi) + εi, i = 1, . . . , n,

where the εi’s are independent N (0, σ2) random variables. In particular,the linear parametric regression model (7.3) for the observations takes theform

(7.5) yi = θ0 g0(xi) + θ1 g1(xi) + · · · + θk gk(xi) + εi, i = 1, . . . , n,

where the εi’s are independent N (0, σ2).

A scatter plot is the collection of data points with the coordinates (xi, yi ),for i = 1, . . . , n. A typical scatter plot for a polynomial regression is shownin Figure 3.

0 X

Y

f(X)(xi, yi)

•••

••

• •••

εi

••

•• •

••

Figure 3. A scatter plot with a fitted polynomial regression function.

Page 98: Mathematical Statistics - 213.230.96.51:8090

7.2. Least-Squares Estimator 87

It is convenient to write (7.5) using vectors. To this end, introducecolumn vectors

y =(y1, . . . , yn

)′, ε =

(ε1, . . . , εn

)′

and

gj =(gj(x1), . . . , gj(xn)

)′, j = 0, . . . k.

Here the prime indicates the operation of vector transposition. In thisnotation, the equations (7.5) turn into

(7.6) y = θ0 g0 + θ1 g1 + · · · + θk gk + ε

where ε ∼ Nn(0, σ2 In). That is, ε has an n-variate normal distribution

with mean 0 = (0, . . . , 0)′ and covariance matrix E[ε ε′

]= σ2 In, where In

is the n× n identity matrix.

Denote a linear span-space generated by the vectors g0, . . . , gk by

S = span{g0, . . . ,gk

}⊆ R

n.

The vectors g0, . . . ,gk are assumed to be linearly independent, so that thedimension of the span-space dim(S) is equal to k + 1. Obviously, it mayhappen only if n ≥ k + 1. Typically, n is much larger than k.

Example 7.3. For the polynomial regression, the span-space S is generatedby the vectors g0 = (1, . . . , 1)′, g1 = (x1, . . . , xn)

′, . . . ,gk = (xk1, . . . , xkn)

′.For distinct values x1, . . . , xn, n ≥ k+1, these vectors are linearly indepen-dent, and the assumption dim(S) = k+1 is fulfilled (see Exercise 11.79). �

Define an n× (k+ 1) matrix G =[g0, . . . ,gk

], called a design matrix,

and let θ =(θ0, . . . , θk

)′denote the vector of the regression coefficients.

The linear regression (7.6) can be written in the matrix form

(7.7) y = Gθ + ε, ε ∼ Nn(0, σ2 In).

7.2. Least-Squares Estimator

In the system of equations (7.5) (or, equivalently, in its vector form (7.6)),the parameters θ0, . . . , θk have unknown values, which should be estimatedfrom the observations (xi, yi), i = 1, . . . , n.

Let y =(y1, . . . , yn)

′ denote the orthogonal projection of y on the span-space S (see Figure 4). This vector is called a fitted (or predicted) responsevector. As any vector in S, this projection is a linear combination of vectorsg0, g1, . . . , gk, that is, there exist some constants θ0, θ1, . . . , θk such that

(7.8) y = θ0 g0 + θ1 g1 + · · · + θk gk.

These coefficients θ0, θ1, . . . , θk may serve as estimates of the unknown pa-rameters θ0, θ1, . . . , θk. Indeed, in the absence of the random error in (7.6),

Page 99: Mathematical Statistics - 213.230.96.51:8090

88 7. Linear Parametric Regression

that is, when ε = 0, we have y = y which implies that θj = θj for allj = 0, . . . , k.

S0

�y

�Gθ

�y

r

ε

Figure 4. Geometric interpretation of the linear parametric regression.

The problem of finding the estimators θ0, θ1, . . . , θk can be looked at asthe minimization problem

(7.9)∥∥y − y

∥∥2 =∥∥y − ( θ0 g0 + · · · + θk gk )

∥∥2 → minθ0,..., θk

.

Here ‖ · ‖ denotes the Euclidean norm of a vector in Rn,

‖y − y ‖2 = (y1 − y1)2 + · · · + (yn − yn)

2.

The estimation procedure consists of finding the minimum of the sum ofsquares of the coordinates, thus, the estimators θ0, . . . , θk are referred to asthe least-squares estimators.

The easiest way to solve the minimization problem is through the geo-metric interpretation of linear regression. In fact, by the definition of aprojection, the vector y− y is orthogonal to every vector in the span-spaceS. In particular, its dot product with any basis vector in S must be equalto zero,

(7.10)(y − y, gj

)= 0, j = 0, . . . , k.

Substituting y in (7.10) by its expression from (7.8), we arrive at the system

of k + 1 linear equations with respect to the estimators θ0, . . . , θk,(y, gj

)− θ0

(g0, gj

)− · · · − θk

(gk, gj

)= 0, j = 0, . . . , k.

These equations can be rewritten in a standard form, which are known as asystem of normal equations,

(7.11) θ0(g0, gj

)+ · · · + θk

(gk, gj

)=(y, gj

), j = 0, . . . , k.

Let θ =(θ0, . . . , θk

)′be the vector of estimated regression coefficients.

Then we can write equations (7.11) in the matrix form

(7.12)(G′G

)θ = G′y.

Page 100: Mathematical Statistics - 213.230.96.51:8090

7.3. Properties of the Least-Squares Estimator 89

By our assumption, the (k+1)× (k+1) matrix G′G, has a full rank k+1,and therefore is invertible. Thus, the least-squares estimator of θ is theunique solution of the normal equations (7.12),

(7.13) θ =(G′G

)−1G′y.

Remark 7.4. Three Euclidean spaces are involved in the linear regression.The primary space is the (X,Y )-plane where observed as well as fitted valuesmay be depicted. Another is the space of observations Rn that includes thelinear subspace S. And the third space is the space R

k+1 that contains thevector of the regression coefficients θ as well as its least-squares estimator θ.Though the latter two spaces play an auxiliary role in practical regressionanalysis, they are important from the mathematical point of view. �

7.3. Properties of the Least-Squares Estimator

Consider the least-squares estimator θ =(θ0, . . . , θk

)′of the vector of the

true regression coefficients θ =(θ0, . . . , θk

)′computed by formula (7.13).

In this section, we study the properties of this estimator.

Recall that we denoted by X = (x1, . . . , xn) the design in the regres-sion model. The explanatory variable X may be assumed deterministic, orrandom with a certain distribution. In what follows, we use the notationEθ[ · | X ] and Varθ[ · | X ] for the conditional expectation and variance withrespect to the distribution of the random error ε, given the design X . Aver-aging over both distributions, ε’s and X ’s, will be designated by Eθ[ · ]. Forthe deterministic designs, we use the notation Eθ[ · | X ] only if we want toemphasize the dependence on the design X .

Theorem 7.5. For a fixed design X , the least-squares estimator θ has a(k+1)-variate normal distribution with mean θ (is unbiased) and covariance

matrix Eθ

[(θ − θ)(θ − θ)′ | X

]= σ2

(G′G

)−1.

Proof. According to the matrix form of the linear regression (7.7), theconditional mean of y, given the design X , is Eθ

[y | X

]= Gθ, and the

conditional covariance matrix of y is equal to

[ (y −Gθ

) (y −Gθ

)′ | X]= Eθ

[ε ε′ | X

]= σ2 In.

Thus, the conditional mean of θ, given the design X , is calculated as

[θ | X

]= Eθ

[ (G′G

)−1G′y | X

]

=(G′G

)−1G′

[y | X

]=(G′G

)−1G′ Gθ = θ.

To find an expression for the conditional covariance matrix of θ, notice first

that θ − θ =(G′G

)−1G′(y −Gθ). Thus,

[(θ − θ)(θ − θ)′ | X

]

Page 101: Mathematical Statistics - 213.230.96.51:8090

90 7. Linear Parametric Regression

= Eθ

[ ( (G′G

)−1G′(y −Gθ)

)( (G′G

)−1G′(y −Gθ)

)′| X]

=(G′G

)−1G′Eθ

[(y −Gθ)(y −Gθ)′ | X

]G(G′G

)−1

=(G′G

)−1G′ σ2 In G

(G′G

)−1= σ2

(G′G

)−1. �

To ease the presentation, we study the regression on the interval [0, 1],that is, we assume that the regression function f(x) and all the componentsin the linear regression model, g0(x), . . . , gk(x), are defined for x ∈ [0, 1].The design points xi, i = 1, . . . , n, also belong to this interval.

Define the least-squares estimator of the regression function f(x) in (7.2),at any point x ∈ [0, 1], by

(7.14) fn(x) = θ0 g0(x) + · · · + θk gk(x).

Here the subscript n indicates that the estimation is based on n pairs ofobservations (xi, yi), i = 1, . . . , n.

A legitimate question is how close fn(x) is to f(x)? We try to answerthis question using two different loss functions. The first one is the quadraticloss function computed at a fixed point x ∈ [0, 1],

(7.15) w(fn − f

)=(fn(x)− f(x)

)2.

The risk with respect to this loss is called the mean squared risk at a pointor mean squared error (MSE).

The second loss function that we consider is the mean squared differenceover the design points

(7.16) w(fn − f

)=

1

n

n∑

i=1

(fn(xi) − f(xi)

)2.

Note that this loss function is a discrete version of the integral L2-norm,

‖ fn − f ‖22 =

∫ 1

0

(fn(x) − f(x)

)2dx.

The respective risk is a discrete version of the mean integrated squared error(MISE).

In this section, we study the conditional risk Eθ

[w(fn − f) | X

], given

the design X . The next two lemmas provide computational formulas for theMSE and discrete MISE, respectively.

Introduce the matrix D = σ2 (G′G)−1 called the covariance matrix.Note that D depends on the design X , and this dependence can be sophisti-cated. In particular, that if the design X is random, this matrix is randomas well.

Page 102: Mathematical Statistics - 213.230.96.51:8090

7.3. Properties of the Least-Squares Estimator 91

Lemma 7.6. For a fixed design X , the estimator fn(x) is an unbiased es-timator of f(x) at any x ∈ [0, 1], so that its MSE equals the variance of

fn(x),

Varθ[fn(x) | X

]= Eθ

[ (fn(x)− f(x)

)2 | X]=

k∑

l,m=0

Dl,m gl(x) gm(x),

where Dl,m denotes the (l,m)-th entry of the covariance matrix D.

Proof. By Theorem 7.5, the least-squares estimator θ is unbiased. Thisimplies the unbiasedness of the estimator fn(x). To see that, write

[fn(x) | X

]= Eθ

[θ0 | X

]g0(x) + · · · + Eθ

[θk | X

]gk(x)

= θ0 g0(x) + · · · + θk gk(x) = f(x).

Also, the covariance matrix of θ is D, and therefore the variance of fn(x)can be written as

[ (fn(x)−f(x)

)2 | X]= Eθ

[ ((θ0−θ0) g0(x)+ · · ·+(θk−θk) gk(x)

)2 | X]

=k∑

l,m=0

[(θl − θl)(θm − θm)|X

]gl(x)gm(x) =

k∑

l,m=0

Dl,mgl(x)gm(x). �

Lemma 7.7. For a fixed design X , the mean squared difference

1

n

n∑

i=1

(fn(xi) − f(xi)

)2= (σ2/n)χ2

k+1

where χ2k+1 denotes a chi-squared random variable with k + 1 degrees of

freedom. In particular, the MISE equals to

[ 1n

n∑

i=1

(fn(xi) − f(xi)

)2 | X]=

σ2(k + 1)

n.

Proof. Applying the facts that σ2G′G = D−1, and that the matrix D issymmetric and positive definite (therefore, D1/2 exists), we have the equa-tions

1

n

n∑

i=1

(fn(xi) − f(xi)

)2=

1

n‖G ( θ − θ ) ‖2

=1

n

(G ( θ − θ )

)′ (G ( θ − θ )

)=

1

n( θ − θ )′G′G ( θ − θ )

= σ2 1

n( θ − θ )′D−1( θ − θ ) =

σ2

n‖D−1/2 ( θ − θ ) ‖2,

where by ‖ · ‖ we mean the Euclidean norm in the Rn space of observations.

Page 103: Mathematical Statistics - 213.230.96.51:8090

92 7. Linear Parametric Regression

By Theorem 7.5, the (k+1)-dimensional vector D−1/2 (θ− θ) has inde-pendent standard normal coordinates. The result of the proposition followsfrom the definition of the chi-squared distribution. �

Note that the vector with the components fn(xi) coincides with y, theprojection of y on the span-space S, that is,

yi = fn(xi), i = 1, . . . , n.

Introduce the vector r = y − y. The coordinates of this vector, calledresiduals, are the differences

ri = yi − yi = yi − fn(xi), i = 1, . . . , n.

In other words, residuals are deviations of the observed responses from thepredicted ones evaluated at the design points.

Graphically, residuals can be visualized in the data space Rn. The vectorof residuals r, plotted in Figure 4, is orthogonal to the span-space S. Also,the residuals ri’s can be depicted on a scatter plot (see Figure 5).

0 X

Y

(xi, yi)◦

(xi, yi)

ri

• fn(X)

Figure 5. Residuals shown on a schematic scatter plot.

In the next lemma, we obtain the distribution of the squared norm ofthe residual vector r for a fixed design X .

Lemma 7.8. For a given design X , the sum of squares of the residuals

r21 + · · · + r2n = ‖ r‖2 = ‖y − y ‖2 = σ2 χ2n−k−1,

where χ2n−k−1 denotes a chi-squared random variable with n− k − 1 degrees

of freedom.

Page 104: Mathematical Statistics - 213.230.96.51:8090

7.4. Asymptotic Analysis of the Least-Squares Estimator 93

Proof. The squared Euclidean norm of the vector of random errors admitsthe partition

‖ ε ‖2 = ‖y −Gθ ‖2 = ‖y − y + y − Gθ ‖2

= ‖y − y ‖2 + ‖ y −Gθ ‖2 = ‖ r ‖2 + ‖ y −Gθ ‖2.Here the cross term is zero, because it is a dot product of the residual vectorr and the vector y−Gθ that lies in the span-space S. Moreover, these twovectors are independent (see Exercise 7.46).

The random vector ε hasNn(0, σ2 In) distribution, implying that ‖ ε ‖2 =

σ2 χ2n , where χ2

n denotes a chi-squared random variable with n degrees offreedom. Also, by Lemma 7.7,

‖ y −Gθ ‖2 =n∑

i=1

(fn(xi) − f(xi)

)2= σ2 χ2

k+1

where χ2k+1 has a chi-squared distribution with k + 1 degrees of freedom.

Taking into account that vectors r and y −Gθ are independent, it canbe shown (see Exercise 7.47) that ‖ r ‖2 has a chi-squared distribution withn− (k + 1) degrees of freedom. �

7.4. Asymptotic Analysis of the Least-Squares Estimator

In this section we focus on describing asymptotic behavior of the least-squares estimator θ as the sample size n goes to infinity. This task is com-plicated by the fact that θ depends on the design X = {x1, . . . , xn}. Thus,we can expect the existence of a limiting distribution only if the design isgoverned by some regularity conditions.

7.4.1. Regular Deterministic Design. Take a continuous strictly posi-tive probability density p(x), 0 ≤ x ≤ 1, and consider the cumulative dis-tribution function FX(x) =

∫ x0 p(t) dt. Define a sequence of regular deter-

ministic designs Xn = {xn,1, . . . , xn,n } where xi,n is the (i/n)-th quantileof this distribution,

(7.17) FX(xn,i) =i

n, i = 1, . . . , n.

Equivalently, the xn,i’s satisfy the recursive equations

(7.18)

∫ xn,i

xn,i−1

p(x) dx =1

n, i = 1, . . . , n, xn,0 = 0.

It is important to emphasize that the distances between consecutive pointsin a regular design have magnitude O(1/n) as n → ∞. Typical irregulardesigns that are avoided in asymptotic analysis have data points that are

Page 105: Mathematical Statistics - 213.230.96.51:8090

94 7. Linear Parametric Regression

too close to each other (concentrated around one point, or even coincide),or have big gaps between each other, or both.

For simplicity we suppress the dependence on n of the regular designpoints, that is, we write xi instead of xn,i, i = 1, . . . , n.

Example 7.9. The data points that are spread equidistantly on the unitinterval, xi = i/n, i = 1, . . . , n, constitute a regular design, called uniformdesign, since these points are (i/n)-th quantiles of the standard uniformdistribution. �

It can be shown (see Exercise 7.48) that in the case of a regular designcorresponding to a probability density p(x), for any continuous functiong(x), the Riemann sum converges to the integral

(7.19)1

n

n∑

i=1

g(xi) →∫ 1

0g(x) p(x) dx as n → ∞.

If the functions g0, g1, . . . , gk in the linear regression model (7.5) arecontinuous, and the design points are regular, then the convergence in (7.19)implies the existence of the entrywise limits of the matrix (1/n)D−1 asn → ∞, that is, for any l and m such that 0 ≤ l ≤ m ≤ k,

limn→∞

1

nD−1

l,m = limn→∞

σ2

n(G′G )l,m

= limn→∞

σ2

n

(gl(x1) gm(x1) + · · · + gl(xn) gm(xn)

)

(7.20) = σ2

∫ 1

0gl(x) gm(x) p(x) dx.

Denote by D−1∞ the matrix with the elements σ2

∫ 10 gl(x) gk(x) p(x) dx.

Assume that this matrix is positive definite. Then its inverse D∞, called thelimiting covariance matrix, exists, and the convergence takes place nD →D∞.

Example 7.10. Consider a polynomial regression model with the uniformdesign on [0, 1], that is, the regular design with the constant probabilitydensity p(x) = 1, 0 ≤ x ≤ 1. The matrix D−1

∞ has the entries

(7.21) σ2

∫ 1

0xl xm dx =

σ2

1 + l +m, 0 ≤ l, m ≤ k.

This is a positive definite matrix, and hence the limiting covariance matrixD∞ is well defined (see Exercise 7.49). �

We are ready to summarize our findings in the following theorem.

Page 106: Mathematical Statistics - 213.230.96.51:8090

7.4. Asymptotic Analysis of the Least-Squares Estimator 95

Theorem 7.11. If X is a regular deterministic design, and D∞ exists, then√n(θ − θ

)→ Nk+1(0, D∞ ) as n → ∞.

Next we study the limiting behavior of the least-squares estimator fndefined by (7.14). The lemma below shows that in the mean squared sense,

fn converges pointwise to the true regression function f at the rate O(1/√n)

as n → ∞. The proof of this lemma is assigned as an exercise (see Exercise7.50).

Lemma 7.12. Suppose X is a regular deterministic design such that D∞exists. Then at any fixed point x ∈ [0, 1], the estimator fn of the regressionfunction f is unbiased and its normalized quadratic risk satisfies the limitingequation

limn→∞

[ (√n ( fn(x)− f(x) )

)2 ]=

k∑

l,m=0

(D∞)l,m gl(x) gm(x),

where (D∞)l,m are the elements of the limiting covariance matrix D∞.

7.4.2. Regular Random Design. We call a random design regular, ifits points are independent with a common continuous and strictly positiveprobability density function p(x), x ∈ [0, 1].

Suppose the functions g0, . . . , gk are continuous on [0, 1]. By the Law ofLarge Numbers, for any element of the matrix D−1 = σ2G′G, we have thatwith probability 1 (with respect to the distribution of the random design),

limn→∞

σ2

n(G′G )l,m = lim

n→∞σ2

n

(gl(x1) gm(x1) + · · · + gl(xn) gm(xn)

)

(7.22) = σ2

∫ 1

0gl(x) gm(x) p(x) dx.

Again, as in the case of a regular deterministic design, we assume that the

matrix D−1∞ with the elements σ2

∫ 10 gl(x) gm(x) p(x) dx is positive definite,

so that its inverse matrix D∞ exists.

The essential difference between the random and deterministic designs isthat even in the case of a regular random design, for any given n, the matrixG′G can be degenerate with a positive probability (see Exercise 7.51). If

it happens, then for the sake of definiteness, we put θ = 0. Fortunately, ifthe functions g0, . . . , gk are continuous in [0, 1], then the probability of this“non-existence” is exponentially small in n as n → ∞. For the proofs of thefollowing lemma and theorem refer to Exercises 7.52 and 7.53.

Page 107: Mathematical Statistics - 213.230.96.51:8090

96 7. Linear Parametric Regression

Lemma 7.13. Assume that |g0|, . . . , |gk| ≤ C0, and that X = {x1, . . . , xn}is a regular random design. Then for any n, for however small δ > 0, andfor all l and m such that 0 ≤ l,m ≤ k, the following inequality holds:

P( ∣∣ 1

n

n∑

i=1

gl(xi)gm(xi)−∫ 1

0gl(x)gm(x) dx

∣∣ > δ)≤ 2 exp

{− δ2 n

2C40

}.

Assume that for a regular random design X , the estimator θ is properlydefined with probability 1. Then, as the next theorem shows, the distribu-tion of the normalized estimator

√n(θ − θ) is asymptotically normal.

Theorem 7.14. If X is a regular random design and D∞ exists, then asn → ∞,

√n(θ − θ

)converges in distribution to a Nk+1(0, D∞) random

variable.

Remark 7.15. An important conclusion is that the parametric least-squaresestimator fn is unbiased, and its typical rate of convergence under variousnorms and under regular designs is equal to O(1/

√n) as n → ∞. �

Exercises

Exercise 7.43. Consider the observations (xi , yi) in a simple linear regres-sion model,

yi = θ0 + θ1 xi + εi, i = 1, . . . , n,

where the εi’s are independent N (0 , σ2) random variables. Write down thesystem of normal equations (7.11) and solve it explicitly.

Exercise 7.44. Show that in a simple linear regression model (see Exercise

7.43), the minimum of the variance Varθ[fn(x) | X

]in Lemma 7.6 is attained

at x = x =∑n

i=1 xi/n.

Exercise 7.45. (i) Prove that in a simple linear regression model (seeExercise 7.43), the sum of residuals is equal to zero, that is,

∑ni=1 ri =∑n

i=1 (yi − yi) = 0.(ii) Consider a simple linear regression through the origin,

yi = θ1 xi + εi, i = 1, . . . , n

where the εi’s are independent N (0, σ2) random variables. Show by givingan example that the sum of residuals is not necessarily equal to zero.

Exercise 7.46. Show that (i) the vector of residuals r has a multivariatenormal distribution with mean zero and covariance matrix σ2 (In − H),where H = G(G′G)−1 G′ is called the hat matrix because of the identity

Page 108: Mathematical Statistics - 213.230.96.51:8090

Exercises 97

y = Hy.(ii) Argue that the vectors r and y −Gθ are independent.

Exercise 7.47. Let Z = X + Y where X and Y are independent. SupposeZ and X have chi-squared distributions with n and m degrees of freedom,respectively, where m < n. Prove that Y also has a chi-squared distributionwith n−m degrees of freedom.

Exercise 7.48. Show the convergence of the Riemann sum in (7.19).

Exercise 7.49. Show that the matrix with the elements given by (7.21) isinvertible.

Exercise 7.50. Prove Lemma 7.12.

Exercise 7.51. Let k = 1, and let g0 = 1; g1(x) = x if 0 ≤ x ≤ 1/2, andg1(x) = 1/2 if 1/2 < x ≤ 1. Assume that X is the uniform random designgoverned by the density p(x) = 1. Show that the system of normal equationsdoes not have a unique solution with probability 1/2n.

Exercise 7.52. Prove Lemma 7.13.

Exercise 7.53. Prove Theorem 7.14.

Exercise 7.54. For the regression function f = θ0 g0 + · · · + θk gk, showthat the conditional expectation of the squared L2-norm of the differencefn − f , given the design X , admits the upper bound

[‖ fn − f ‖22 | X

]≤ tr(D) ‖g ‖22

where the trace tr(D) = Eθ

[ ∑ki=0 (θi−θi)

2 | X]is the sum of the diagonal

elements of the covariance matrix D, and

‖g ‖22 =k∑

i=0

‖ gi ‖22 =k∑

i=0

∫ 1

0

(gi(x)

)2dx

is the squared L2-norm of the vector g = (g0, . . . , gk)′.

Page 109: Mathematical Statistics - 213.230.96.51:8090
Page 110: Mathematical Statistics - 213.230.96.51:8090

Part 2

NonparametricRegression

Page 111: Mathematical Statistics - 213.230.96.51:8090
Page 112: Mathematical Statistics - 213.230.96.51:8090

Chapter 8

Estimation inNonparametricRegression

8.1. Setup and Notations

In a nonparametric regression model the response variable Y and the ex-planatory variable X are related by the same regression equation (7.1) as ina parametric regression model,

(8.1) Y = f(X) + ε

with the random error ε ∼ N (0, σ2). However, unlike that in the parametricregression model, here the algebraic form of the regression function f isassumed unknown and must be evaluated from the data. The goal of thenonparametric regression analysis is to estimate the function f as a curve,rather than to estimate parameters of a guessed function.

A set of n pairs of observations (x1, y1), . . . , (xn, yn) satisfy the relation

(8.2) yi = f(xi) + εi, i = 1, . . . , n,

where the εi’s are independent N (0, σ2) random errors. For simplicity weassume that the design X = {x1, . . . , xn} is concentrated on [0, 1].

In nonparametric regression analysis, some assumptions are made a pri-ori on the smoothness of the regression function f . Let β ≥ 1 be an integer.We assume that f belongs to a Holder class of functions of smoothness β,denoted by Θ(β, L, L1). That is, we assume that (i) its derivative f (β−1) of

101

Page 113: Mathematical Statistics - 213.230.96.51:8090

102 8. Estimation in Nonparametric Regression

order β − 1 satisfies the Lipschitz condition with a given constant L,

| f (β−1)(x2) − f (β−1)(x1) | ≤ L |x2 − x1 |, x1, x2 ∈ [0, 1],

and (ii) there exists a constant L1 > 0 such that

max0≤x≤ 1

| f(x) | ≤ L1.

Example 8.1. If β = 1, the class Θ(1, L, L1) is a set of bounded Lipschitzfunctions. Recall that a Lipschitz function f satisfies the inequality

| f(x2) − f(x1) | ≤ L |x2 − x1 |

where L is a constant independent of x1 and x2. �

Sometimes we write Θ(β), suppressing the constants L and L1 in thenotation of the Holder class Θ(β, L, L1).

Denote by fn the nonparametric estimator of the regression function f .Since f is a function of x ∈ [0, 1], so should be the estimator. The latter,however, also depends on the data points. This dependence is frequentlyomitted in the notation,

fn(x) = fn(x ; (x1, y1), . . . , (xn, yn)

), 0 ≤ x ≤ 1.

To measure how close fn is to f , we consider the same loss functions asin Chapter 7, the quadratic loss function computed at a fixed point x ∈ [0, 1]specified in (7.15), and the mean squared difference over the design pointsgiven by (7.16). In addition, to illustrate particular effects in nonparametricestimation, we use the sup-norm loss function

w(fn − f) = ‖ fn − f ‖∞ = sup0≤ x≤ 1

∣∣ fn(x) − f(x)∣∣.

Note that in the nonparametric case, the loss functions are, in fact,functionals since they depend of f . For simplicity, we will continue callingthem functions. We denote the risk function by

Rn(f, fn) = Ef

[w(fn − f)

]

where the subscript f in the expectation refers to a fixed regression functionf . If the design X is random, we use the conditional expectation Ef [ · | X ]to emphasize averaging over the distribution of the random error ε.

When working with the difference fn − f , it is technically more conve-nient to consider separately the bias bn(x) = Ef

[fn(x)

]− f(x), and the

stochastic part ξn(x) = fn(x) − Ef

[fn(x)

]. Then the MSE or discrete

MISE is split into a sum (see Exercise 8.55),

(8.3) Rn(fn, f) = Ef

[w(fn − f)

]= Ef

[w(ξn)

]+ w(bn).

Page 114: Mathematical Statistics - 213.230.96.51:8090

8.2. Asymptotically Minimax Rate of Convergence. Definition 103

For the sup-norm loss function, the triangle inequality applies

Rn(fn, f) = Ef

[‖fn − f‖∞

]≤ Ef

[‖ξn‖∞

]+ ‖bn‖∞.

To deal with random designs, we consider the conditional bias and sto-chastic part of an estimator fn, given the design X ,

bn(x,X ) = Ef

[fn(x) | X ] − f(x)

and

ξn(x,X ) = fn(x) − Ef

[fn(x) | X

].

8.2. Asymptotically Minimax Rate of Convergence.Definition

We want to estimate the regression function in the most efficient way. Asa criterion of optimality we choose the asymptotically minimax rate of con-vergence of the estimator.

Consider a deterministic sequence of positive numbers ψn → 0 as n →∞. Introduce a maximum normalized risk of an estimator fn with respectto a loss function w by

(8.4) rn(fn, w, ψn) = supf∈Θ(β)

Ef

[w( fn − f

ψn

) ].

A sequence of positive numbers ψn is called an asymptotically minimaxrate of convergence if there exist two positive constants r∗ and r∗ such thatfor any estimator fn, the maximum normalized risk rn(fn, w, ψn) is boundedfrom above and below,

(8.5) r∗ ≤ lim infn→∞

rn(fn, w, ψn) ≤ lim supn→∞

rn(fn, w, ψn) ≤ r∗.

This very formal definition has a transparent interpretation. It impliesthat for any estimator fn and for all n large enough, the maximum of therisk is bounded from below,

(8.6) supf ∈Θ(β)

Ef

[w( fn − f

ψn

) ]≥ r∗ − ε,

where ε is an arbitrarily small positive number. On the other hand, thereexists an estimator f∗

n, called the asymptotically minimax estimator, themaximum risk of which is bounded from above,

(8.7) supf ∈Θ(β)

Ef

[w( f∗

n − f

ψn

) ]≤ r∗ + ε.

Note that f∗n is not a single estimator, rather a sequence of estimators defined

for all sufficiently large n.

Page 115: Mathematical Statistics - 213.230.96.51:8090

104 8. Estimation in Nonparametric Regression

It is worth mentioning that the asymptotically minimax rate of conver-gence ψn is not uniquely defined but admits any bounded and separatedaway from zero multiplier. As we have shown in Chapter 7, a typical rate ofconvergence in parametric regression model is O(1/

√n). In nonparametric

regression, on the other hand, the rates depend on a particular loss func-tion and on the smoothness parameter β of the Holder class of regressionfunctions. We study these rates in the next chapters.

8.3. Linear Estimator

8.3.1. Definition. An estimator fn is called a linear estimator of f , iffor any x ∈ [0, 1], there exist weights υn, i(x) that may also depend on thedesign points, υn, i(x) = υn, i(x,X ), i = 1, . . . , n, such that

(8.8) fn(x) =n∑

i=1

υn, i(x) yi.

Note that the linear estimator fn is a linear function of the response valuesy1, . . . , yn. The weight vn, i(x) determines the influence of the observation

yi on the estimator fn(x) at point x.

An advantage of the linear estimator (8.8) is that for a given design X ,the conditional bias and variance are easily computable (see Exercise 8.56),

(8.9) bn(x,X ) =n∑

i=1

υn, i(x) f(xi) − f(x)

and

(8.10) Ef

[ξ2n(x,X ) | X

]= σ2

n∑

i=1

υ2n, i(x).

These formulas are useful when either the design X is deterministic orintegration over the distribution of a random design is not too difficult. Sincethe weights υn, i(x) may depend on the design points in a very intricate way,in general, averaging over the distribution of x1, . . . , xn is a complicatedtask.

The linear estimator (8.8) is not guaranteed to be unbiased. Even inthe simplest case of a constant regression function f(x) = θ0, the linearestimator is unbiased if and only if the weights sum up to one,

bn(x,X ) =n∑

i=1

υn,i(x) θ0 − θ0 = θ0( n∑

i=1

υn,i(x)− 1)= 0.

Page 116: Mathematical Statistics - 213.230.96.51:8090

8.3. Linear Estimator 105

For a linear regression function f(x) = θ0 + θ1x, the linear estimator isunbiased if and only if the following identity holds:

( n∑

i=1

υn,i(x)− 1)θ0 +

( n∑

i=1

υn,i(x)xi − x)θ1 = 0 ,

which under the condition that∑n

i=1 υn,i(x) = 1 is tantamount to theidentity

n∑

i=1

υn,i(x)xi = x, uniformly in x ∈ [0, 1].

If for any x ∈ [0, 1], the linear estimator (8.8) depends on all the designpoints x1, . . . , xn, it is called a global linear estimator of the regression func-tion. We study global estimators later in this book.

An estimator (8.8) is called a local linear estimator of the regressionfunction if the weights υn, i(x) differ from zero only for those i’s for which thedesign points xi’s belong to a small neighborhood of x, that is, |xi−x | ≤ hn,where hn is called a bandwidth. We always assume that

(8.11) hn > 0, hn → 0 , and nhn → ∞ as n → ∞.

In what follows we consider only designs in which for any x ∈ [0, 1], thenumber of the design points in the hn-neighborhood of x has the magnitudeO(nhn) as n → ∞.

8.3.2. The Nadaraya-Watson Kernel Estimator. Consider a smoothor piecewise smooth function K = K(u), u ∈ R. Assume that the supportof K is the interval [−1, 1], that is, K(u) = 0 if |u| > 1. The function K iscalled a kernel function or simply, a kernel.

Example 8.2. Some classical kernel functions frequently used in practiceare:

(i) uniform, K(u) = (1/2) I( |u| ≤ 1 ),(ii) triangular, K(u) = ( 1− |u| ) I( |u| ≤ 1 ),(iii) bi-square, K(u) = (15/16) ( 1− u2 )2 I( |u| ≤ 1 ),(iv) the Epanechnikov kernel, K(u) = (3/4) ( 1− u2 ) I( |u| ≤ 1 ). �

Remark 8.3. Typically, kernels are normalized in such a way that theyintegrate to one. It can be shown (see Exercise 8.57) that all the kernelsintroduced above are normalized in such a way. �

For a chosen kernel and a bandwidth, define the weights υn, i(x) by

(8.12) υn, i(x) = K( xi − x

hn

)/

n∑

j=1

K( xj − x

hn

).

Page 117: Mathematical Statistics - 213.230.96.51:8090

106 8. Estimation in Nonparametric Regression

The Nadaraya-Watson kernel estimator fn of the regression function fat a given point x ∈ [0, 1] is the linear estimator with the weights definedby (8.12),

(8.13) fn(x) =

n∑

i=1

yiK( xi − x

hn

)/

n∑

j=1

K( xj − x

hn

).

Note that the Nadaraya-Watson estimator is an example of a local linearestimator, since outside of the interval [x−hn , x+hn], the weights are equalto zero.

Example 8.4. Consider the uniform kernel defined in Example 8.2 (i). LetN(x, hn) denote the number of the design points in the hn-neighborhood ofx. Then the weights in (8.12) have the form

υn, i(x) =1

N(x, hn)I(x− hn < xi < x+ hn

).

Thus, in this case, the Nadaraya-Watson estimator is the average ofthe observed responses that correspond to the design points in the hn-neighborhood of x,

fn(x) =1

N(x, hn)

n∑

i=1

yi I(x− hn < xi < x+ hn

). �

8.4. Smoothing Kernel Estimator

In Section 8.3, we explained the challenge to control the conditional bias ofa linear estimator even in the case of a linear regression function. The linearregression function is important as the first step because, as the followinglemma shows, any regression function from a Holder class is essentially apolynomial. The proof of this auxiliary lemma is postponed until the endof this section.

Lemma 8.5. For any function f ∈ Θ(β, L, L1), the following Taylor expan-sion holds:

(8.14) f(xi) =

β−1∑

m=0

f (m)(x)

m!(xi − x)m + ρ(xi, x), 0 ≤ x, xi ≤ 1,

where f (m) denotes the m-th derivative of f . Also, for any xi and x suchthat |xi − x| ≤ hn, the remainder term ρ(xi, x) satisfies the inequality

(8.15) | ρ(xi, x) | ≤Lhβn

(β − 1)!.

Page 118: Mathematical Statistics - 213.230.96.51:8090

8.4. Smoothing Kernel Estimator 107

It turns out that for linear estimators, regular random designs have anadvantage over deterministic ones. As we demonstrate in this section, whencomputing the risk, averaging over the distribution of a random design helpsto eliminate a significant portion of the bias.

Next we introduce a linear estimator that guarantees the zero bias forany polynomial regression function up to degree β − 1 (see Exercise 8.59).To ease the presentation, we assume that a regular random design is uniformwith the probability density p(x) = 1, x ∈ [0, 1]. The extension to a moregeneral case is given in Remark 8.6.

A smoothing kernel estimator fn(x) of degree β−1 is given by the formula

(8.16) fn(x) =1

nhn

n∑

i=1

yiK( xi − x

hn

), 0 < x < 1,

where the smoothing kernel K = K(u), |u| ≤ 1, is bounded, piecewisecontinuous, and satisfies the normalization and orthogonality conditions(8.17)∫ 1

−1K(u) du = 1 and

∫ 1

−1umK(u) du = 0 for m = 1, . . . , β − 1.

Note that the smoothing kernel is orthogonal to all monomials up to degreeβ − 1.

Remark 8.6. For a general density p(x) of the design points, the smoothingkernel estimator is defined as

(8.18) fn(x) =1

nhn

n∑

i=1

yip(xi)

K( xi − x

hn

)

where the kernel K(u) satisfies the same conditions as in (8.17). �

Remark 8.7. A smoothing kernel estimator (8.16) requires that x lies

strictly inside the unit interval. In fact, the definition of fn(x) is validfor any x such that hn ≤ x ≤ 1− hn. On the other hand, a linear estimator(8.8) is defined for any x ∈ [0, 1], including the endpoints. Why does thesmoothing kernel estimator fail if x coincides with either of the endpoints?If, for instance, x = 0, then for any symmetric kernel K(u), the expectedvalue

1

hnEf

[K( xihn

) ]=

1

hn

∫ hn

0K( xihn

)dxi =

∫ 1

0K(u) du =

1

2.

For example, in the situation when the regression function is identicallyequal to 1, the responses are yi = 1 + εi, where εi are N (0, σ2) randomvariables independent of xi’s for all i = 1, . . . , n. The average value of the

Page 119: Mathematical Statistics - 213.230.96.51:8090

108 8. Estimation in Nonparametric Regression

smoothing kernel estimator at zero is

Ef

[fn(0)

]= Ef

[ 1

nhn

n∑

i=1

(1 + εi)K( xihn

) ]=

1

2,

which is certainly not satisfactory.

A remedy for the endpoints is to define a one-sided kernel to preservethe normalization and orthogonality conditions (8.17). In Exercises 8.61 and8.62 we formulate some examples related to this topic. �

The next lemma gives upper bounds for the bias and variance of thesmoothing kernel estimator (8.16). The proof of the lemma can be found atthe end of this section.

Lemma 8.8. For any regression function f ∈ Θ(β, L, L1), at any pointx ∈ (0, 1), the bias and variance of the smoothing kernel estimator (8.16)admit the upper bounds for all large enough n,

| bn(x) | ≤ Abhβn and Varf

[fn(x)

]≤ Av

nhnwith the constants

Ab =L ‖K‖1(β − 1)!

and Av = (L21 + σ2) ‖K‖22

where ‖K‖1 =∫ 1−1 |K(u)| du and ‖K‖22 =

∫ 1−1 K2(u) du.

Remark 8.9. The above lemma clearly indicates that as hn increases, theupper bound for the bias increases, while that for the variance decreases. �

Applying this lemma, we can bound the mean squared risk of fn(x) ata point x ∈ (0, 1) by

(8.19) Ef

[ (fn(x)− f(x)

)2 ]= b2n(x) + Varf

[fn(x)

]≤ A2

bh2βn +

Av

nhn.

It is easily seen that the value of hn that minimizes the right-hand sideof (8.19) satisfies the equation

(8.20) h2βn =A

nhnwith a constant factor A independent of n. This equation is called thebalance equation since it reflects the idea of balancing the squared bias andvariance terms.

Next, we neglect the constant in the balance equation (8.20), and labelthe respective optimal bandwidth by a superscript (*). It is a solution of theequation

h2βn =1

nhn,

Page 120: Mathematical Statistics - 213.230.96.51:8090

8.4. Smoothing Kernel Estimator 109

and is equal to

h∗n = n−1/(2β+1).

Denote by f∗n(x) the smoothing kernel estimator (8.16) corresponding to the

optimal bandwidth h∗n,

(8.21) f∗n(x) =

1

nh∗n

n∑

i=1

yiK( xi − x

h∗n

).

We call this estimator the optimal smoothing kernel estimator.

For the convenience of reference, we formulate the proposition below.Its proof follows directly from the expression (8.19), and the definition ofthe estimator f∗

n(x).

Proposition 8.10. For all large enough n, and any f ∈ Θ(β), the quadraticrisk of the optimal smoothing kernel estimator (8.21) at a given point x, 0 <x < 1, is bounded from above by

Ef

[ (f∗n(x) − f(x)

)2 ] ≤ (A2b + Av)n

−2β/(2β+1).

Remark 8.11. Suppose the loss function is the absolute difference at agiven point x ∈ (0, 1). Then the supremum over f ∈ Θ(β) of the risk of theestimator f∗

n(x) is bounded from above by

supf∈Θ(β)

Ef

[| f∗

n(x) − f(x) |]≤ (A2

b + Av)1/2 n−β/(2β+1).

This follows immediately from Proposition 8.10 and the Cauchy-Schwarzinequality. �

Finally, we give the proofs of two technical lemmas stated in this section.

Proof of Lemma 8.5. We need to prove that the bound (8.15) for theremainder term is valid. For β = 1, the bound follows from the definition ofthe Lipschitz class of functions Θ(1, L, L1),

| ρ(xi, x) | = | f(xi)− f(x) | ≤ L|xi − x| ≤ Lhn.

If β ≥ 2, then the Taylor expansion with the Lagrange remainder term hasthe form

(8.22) f(xi) =

β−2∑

m=0

f (m)(x)

m!(xi − x)m +

f (β−1)(x∗)

(β − 1)!(xi − x)β−1

where x∗ is an intermediate point between x and xi, so that |x∗ − x | ≤ hn.This remainder can be transformed into

f (β−1)(x∗)

(β − 1)!(xi − x)β−1 =

f (β−1)(x)

(β − 1)!(xi − x)β−1 + ρ(xi, x)

Page 121: Mathematical Statistics - 213.230.96.51:8090

110 8. Estimation in Nonparametric Regression

where the new remainder term ρ(xi, x), satisfies the inequality for any xiand x such that |xi − x| ≤ hn,

| ρ(xi, x) | =| f (β−1)(x∗)− f (β−1)(x) |

(β − 1)!|xi − x|β−1

≤ L|x∗ − x|(β − 1)!

|xi − x|β−1 ≤ Lhn(β − 1)!

hβ−1n =

Lhβn(β − 1)!

.

In the above, the definition of the Holder class Θ(β, L, L1) has been applied.�

Proof of Lemma 8.8. Using the definition of the bias and the regressionequation yi = f(xi) + εi, we write

bn(x) =1

nhnEf

[ n∑

i=1

yiK( xi − x

hn

) ]− f(x)

(8.23) =1

nhnEf

[ n∑

i=1

(f(xi) + εi)K( xi − x

hn

) ]− f(x).

Now since εi has mean zero and is independent of xi,

Ef

[ n∑

i=1

εiK( xi − x

hn

) ]= 0.

Also, by the normalization condition,

1

hnEf

[K( xi − x

hn

) ]=

1

hn

∫ x+hn

x−hn

K( xi − x

hn

)dxi =

∫ 1

−1K(u) du = 1.

Consequently, continuing from (8.23), we can write

(8.24) bn(x) =1

nhnEf

[ n∑

i=1

(f(xi)− f(x)

)K( xi − x

hn

) ].

Substituting Taylor’s expansion (8.14) of the function f(xi) into (8.24), weget that for any β > 1,

|bn(x)| =1

nhn

∣∣∣Ef

[ n∑

i=1

( β−1∑

m=1

f (m)(x)(xi − x)m

m!+ ρ(xi, x)

)K( xi − x

hn

) ] ∣∣∣

≤ 1

hn

∣∣∣

β−1∑

m=1

∫ x+hn

x−hn

f (m)(x) (x1 − x)m

m!K( x1 − x

hn

)dx1

+1

hnmax

z:|z−x|≤hn

|ρ(z, x)|∫ x+hn

x−hn

∣∣∣K( x1 − x

hn

) ∣∣∣ dx1.

In the above, we replaced xi by x1 due to the independence of the designpoints. If β = 1, we agree to define the sum over m as zero. For any β > 1,

Page 122: Mathematical Statistics - 213.230.96.51:8090

8.4. Smoothing Kernel Estimator 111

this sum equals zero as well, which can be seen from the orthogonalityconditions. For m = 1, . . . , β − 1,

∫ x+hn

x−hn

(x1 − x)mK( x1 − x

hn

)dx1 = hm+1

n

∫ 1

−1umK(u) du = 0.

Thus, using the inequality (8.15) for the remainder term ρ(xi, x), we obtainthat for any β ≥ 1, the absolute value of the bias is bounded by

|bn(x)| ≤1

hnmax

z:|z−x|≤hn

|ρ(z, x)|∫ x+hn

x−hn

∣∣∣K( x1 − x

hn

) ∣∣∣ dx1

≤ Lhβn(β − 1)!

∫ 1

−1|K(u) | du =

L‖K‖1hβn(β − 1)!

= Abhβn.

Further, to find a bound for the variance of fn(x), we use the indepen-dence of the data points to write

Varf[fn(x)

]= Varf

[ 1

nhn

n∑

i=1

yiK( xi − x

hn

) ]

=1

(nhn)2

n∑

i=1

Varf

[yiK

( xi − x

hn

) ].

Now we bound the variance by the second moment, and plug in the regres-sion equation yi = f(xi) + εi,

≤ 1

(nhn)2

n∑

i=1

Ef

[y2i K

2( xi − x

hn

) ]

=1

(nhn)2

n∑

i=1

Ef

[ (f(xi) + εi

)2K2( xi − x

hn

) ]

=1

(nhn)2

n∑

i=1

Ef

[ (f2(xi) + ε2i

)K2( xi − x

hn

) ].

Here the cross term disappears because of independence of εi and xi, andthe fact that the expected value of εi is zero. Finally, using the facts that|f(xi)| ≤ L1 and Ef

[ε2i]= σ2, we find

≤ 1

(nhn)2n(L21 + σ2

) ∫ x+hn

x−hn

K2( x1 − x

hn

)dx1

=1

nhn

(L21 + σ2

) ∫ 1

−1K2(u) du =

1

nhn

(L21 + σ2

)‖K‖22 =

Av

nhn. �

Page 123: Mathematical Statistics - 213.230.96.51:8090

112 8. Estimation in Nonparametric Regression

Exercises

Exercise 8.55. Prove (8.3) for: (i) the quadratic loss at a point

w(fn − f

)=(fn(x)− f(x)

)2,

and (ii) the mean squared difference

w(fn − f

)=

1

n

n∑

i=1

(fn(xi) − f(xi)

)2.

Exercise 8.56. Prove (8.9) and (8.10).

Exercise 8.57. Show that the kernels introduced in Example 8.2 integrateto one.

Exercise 8.58. Consider the Nadaraya-Watson estimator defined by (8.13).Show that conditional on the design X , its bias(i) is equal to zero, for any constant regression function f(x) = θ0,(ii) does not exceed Lhn in absolute value, for any regression function f ∈Θ(1, L, L1).

Exercise 8.59. Prove that the smoothing kernel estimator (8.16) is unbi-ased if the regression function f is a polynomial up to order β − 1.

Exercise 8.60. Find the normalizing constant C such that the tri-cubekernel function

K(u) = C( 1− |u|3 )3 I( |u| ≤ 1 )

integrates to one. What is its degree? Hint: Use (8.17).

Exercise 8.61. To define a smoothing kernel estimator at either endpointof the unit interval, we can use formula (8.16), with K(u) being a one-sidedkernel function (see Remark 8.7).(i) Show that to estimate the regression function at x = 0, the kernel

K(u) = 4− 6u, 0 ≤ u ≤ 1,

may be applied, that satisfies the normalization and orthogonality conditions∫ 1

0K(u) du = 1 and

∫ 1

0uK(u) du = 0.

(ii) Show that at x = 1, the kernel

K(u) = 4 + 6u, −1 ≤ u ≤ 0,

Page 124: Mathematical Statistics - 213.230.96.51:8090

Exercises 113

may be used, which satisfies the normalization and orthogonality conditions∫ 0

−1K(u) du = 1 and

∫ 0

−1uK(u) du = 0.

Exercise 8.62. Refer to Exercise 8.61. We can apply a one-sided smoothingkernel to estimate the regression function f at x where 0 ≤ x ≤ hn. Forexample, we can take K(u) = 4 − 6u, 0 ≤ u ≤ 1. However, this kernelfunction does not use the observations located between 0 and x.

To deal with this drawback, we can introduce a family of smoothingkernels Kθ(u) that utilize all the observations to estimate the regressionfunction for any x such that 0 ≤ x ≤ hn.(i) Let x = xθ = θhn, 0 ≤ θ ≤ 1. Find a family of smoothing kernelsKθ(u) with the support [−θ, 1], satisfying the normalization and orthogonalconditions ∫ 1

−θKθ(u) du = 1 and

∫ 1

−θuKθ(u) du = 0 .

Hint: Search for Kθ(u) in the class of linear functions.(ii) Let x = xθ = 1 − θhn, 0 ≤ θ ≤ 1. Show that the family of smoothingkernels Kθ(−u), −1 ≤ u ≤ θ, can be applied to estimate f(x) for any xsuch that 1− hn ≤ x ≤ 1.

Page 125: Mathematical Statistics - 213.230.96.51:8090
Page 126: Mathematical Statistics - 213.230.96.51:8090

Chapter 9

Local PolynomialApproximation of theRegression Function

9.1. Preliminary Results and Definition

In a small neighborhood of a fixed point x ∈ [0, 1], an unknown nonpara-metric regression function f(x) can be approximated by a polynomial. Thismethod, called the local polynomial approximation, is introduced in this sec-tion. Below we treat the case of the point x lying strictly inside the unitinterval, 0 < x < 1. The case of x being one of the endpoints is left as anexercise (see Exercise 9.64.)

Choose a bandwidth hn that satisfies the standard conditions (8.11),

hn > 0 , hn → 0 , and nhn → ∞ as n → ∞.

Let n be so large that the interval [x − hn, x + hn] ⊆ [0, 1]. Denote byN the number of observations in the interval [x− hn, x+ hn],

N = #{i : xi ∈ [x− hn, x+ hn]

}.

Without loss of generality, we can assume that the observations (xi, yi)are distinct and numbered so that the first N design points belong to thisinterval,

x− hn ≤ x1 < · · · < xN ≤ x+ hn.

Consider the restriction of the original nonparametric Holder regressionfunction f ∈ Θ(β) = Θ(β, L, L1) to the interval [x − hn, x + hn]. That is,

115

Page 127: Mathematical Statistics - 213.230.96.51:8090

116 9. Local Polynomial Approximation of the Regression Function

consider f = f(t) where x−hn ≤ t ≤ x+hn. Recall that every function f inΘ(β) is essentially a polynomial of degree β−1 with a small remainder termdescribed in Lemma 8.5. Let us forget for a moment about the remainderterm, and let us try to approximate the nonparametric regression functionby a parametric polynomial regression of degree β − 1. The least-squaresestimator in the parametric polynomial regression is defined via the solutionof the minimization problem with respect to the estimates of the regressioncoefficients θ0, . . . , θβ−1,(9.1)N∑

i=1

(yi −

[θ0 + θ1

(xi − x

hn

)+ · · · + θβ−1

(xi − x

hn

)β−1 ] )2→ min

θ0, ..., θβ−1

.

In each monomial, it is convenient to subtract x as the midpoint of theinterval [x − hn, x + hn], and to scale by hn so that the monomials do notvanish as hn shrinks.

Recall from Chapter 7 that solving the minimization problem (9.1) isequivalent to solving the system of normal equations

(9.2)(G′G

)θ = G′y

where θ =(θ0, . . . , θβ−1

)′and G =

(g0, . . . ,gβ−1

)is the design matrix.

Its m-th column has the form

gm =((x1 − x

hn

)m, . . . ,

(xN − x

hn

)m )′, m = 0, . . . , β − 1.

The system of normal equations (9.2) has a unique solution if the matrixG′G is invertible. We always make this assumption. It suffices to requirethat the design points are distinct and that N ≥ β.

Applying Lemma 8.5, we can present each observation yi as the sum ofthe three components: a polynomial of degree β− 1, a remainder term, anda random error,

(9.3) yi =

β−1∑

m=0

f (m)(x)

m!(xi − x)m + ρ (xi, x) + εi

where

| ρ (xi, x) | ≤Lhβn

(β − 1)!= O(hβn), i = 1, . . . , N.

The system of normal equations (9.2) is linear in y, hence each compo-nent of yi in (9.3) can be treated separately. The next lemma provides theinformation about the first polynomial component.

Page 128: Mathematical Statistics - 213.230.96.51:8090

9.1. Preliminary Results and Definition 117

Lemma 9.1. If each entry of y = (y1, . . . , yN )′ has only the polynomialcomponent, that is,

yi =

β−1∑

m=0

f (m)(x)

m!(xi−x)m =

β−1∑

m=0

f (m)(x)

m!hmn

( xi − x

hn

)m, i = 1, . . . , N,

then the least-squares estimates in (9.1) are equal to

θm =f (m)(x)

m!hmn , m = 0, . . . , β − 1.

Proof. The proof follows immediately if we apply the results of Section 7.1.Indeed, the vector y belongs to the span-space S, so it stays unchanged afterprojecting on this space. �

To establish results concerning the remainder ρ (xi, x) and the randomerror term εi in (9.3), some technical preliminaries are needed. In view of thefact that | (xi−x)/hn | ≤ 1, all elements of matrix G have a magnitude O(1)as n increases. That is why, generally speaking, the elements of the matrixG′G have a magnitude O(N), assuming that the number of points N maygrow with n. These considerations shed light on the following assumption,which plays an essential role in this chapter.

Assumption 9.2. For a given design X , the absolute values of the elements

of the covariance matrix(G′G

)−1are bounded from above by γ0 N

−1 witha constant γ0 independent of n. �

The next lemma presents the results on the remainder and stochasticterms in (9.3).

Lemma 9.3. Suppose Assumption 9.2 holds. Then the following is valid.(i) If yi = ρ (xi, x), then the solution θ of the system of normal equations

(9.2) has the elements θm, m = 0, . . . , β − 1, bounded by

| θm | ≤ Cb hβn where Cb =

γ0βL

(β − 1)!.

(ii) If yi = εi, then the solution θ of the system of normal equations (9.2)

has the zero-mean normal elements θm, m = 0, . . . , β − 1, the variances ofwhich are bounded by

Varf[θm | X

]≤ Cv

Nwhere Cv = (σγ0β)

2.

Proof. (i) As the solution of the normal equations (9.2), θ =(G′G

)−1G′y.

All the elements of the matrix G′ are of the form((xi−x)/hn

)m, and thus

are bounded by one. Therefore, using Assumption 9.2, we conclude that

the entries of the β ×N matrix(G′G

)−1G′ are bounded by γ0β/N . Also,

Page 129: Mathematical Statistics - 213.230.96.51:8090

118 9. Local Polynomial Approximation of the Regression Function

from (8.15), the absolute values of the entries of the vector y are bounded

by Lhβn/(β − 1)! since they are the remainder terms. After we compute thedot product, N cancels, and we obtain the answer.

(ii) The element θm is the dot product of the m-th row of the matrix(G′G

)−1G′ and the random vector (ε1, . . . , εN )′. Therefore, θm is the

sum of independent N (0, σ2) random variables with the weights that donot exceed γ0β/N . This sum has mean zero and the variance bounded byNσ2(γ0β/N)2 = (σγ0β)

2/N. �

Combining the results of Lemmas 8.5, 9.1, and 9.3, we arrive at thefollowing conclusion.

Proposition 9.4. Suppose Assumption 9.2 holds. Then the estimate θm,which is the m-th element of the solution of the system of normal equations(9.2), admits the expansion

θm =f (m)(x)

m!hmn + bm + Nm , m = 0, . . . , β − 1,

where the deterministic term bm is the conditional bias satisfying

| bm | ≤ Cbhβn,

and the stochastic term Nm has a normal distribution with mean zero andvariance bounded by

Varf[Nm | X

]≤ Cv/N.

Finally, we are ready to introduce the local polynomial estimator fn(t),which is defined for all t such that x− hn ≤ t ≤ x+ hn by

(9.4) fn(t) = θ0 + θ1

( t− x

hn

)+ · · · + θβ−1

( t− x

hn

)β−1

where the least-squares estimators θ0, . . . , θβ−1 are as described in Proposi-tion 9.4.

The local polynomial estimator (9.4) corresponding to the bandwidth

h∗n = n−1/(2β+1) will be denoted by f∗n(t). Recall from Section 8.4 that h∗n

is called the optimal bandwidth, and it solves the equation (h∗n)2β = (nh∗n)

−1.

The formula (9.4) is significantly simplified if t = x. In this case the

local polynomial estimator is just the estimate of the intercept, fn(x) = θ0.

Up to this point there was no connection between the number of thedesign points N in the hn-neighborhood of x and the bandwidth hn. Sucha connection is necessary if we want to balance the bias and the varianceterms in Proposition 9.4.

Assumption 9.5. There exists a positive constant γ1, independent of n,such that for all large enough n the inequality N ≥ γ1nhn holds. �

Page 130: Mathematical Statistics - 213.230.96.51:8090

9.2. Polynomial Approximation and Regularity of Design 119

Now we will prove the result on the conditional quadratic risk at a pointof the local polynomial estimator.

Theorem 9.6. Suppose Assumptions 9.2 and 9.5 hold with hn = h∗n =n−1/(2β+1) . Consider the local polynomial estimator f∗

n(x) corresponding toh∗n. Then for a given design X , the conditional quadratic risk of f∗

n(x) atthe point x ∈ (0, 1) admits the upper bound

supf ∈Θ(β)

Ef

[ (f∗n(x)− f(x)

)2 ∣∣X]≤ r∗ n−2β/(2β+1)

where a positive constant r∗ is independent of n.

Proof. By Proposition 9.4, for any f ∈ Θ(β), the conditional quadratic riskof the local polynomial estimator f∗

n is equal to

Ef

[ (f∗n(x)− f(x)

)2 ∣∣X]= Ef

[ (θ0 − f(x)

)2 | X]

= Ef

[ (f(x) + b0 + N0 − f(x)

)2 ∣∣X]= b20 + Ef

[N 2

0 | X]

= b20 + Varf[N0 | X

]≤ C2

b (h∗n)

2β + Cv/N .

Applying Assumption 9.5 and the fact that h∗n satisfies the identity (h∗n)2β =

(nh∗n)−1 = n−2β/(2β+1), we obtain that

Ef

[ (f∗n(x)− f(x)

)2 | X]≤ C2

b (h∗n)

2β +Cv

γ1nh∗n= r∗ n−2β/(2β+1)

with r∗ = C2b + Cv/γ1. �

Remark 9.7. Proposition 9.4 also opens a way to estimate the derivativesf (m)(t) of the regression function f. The estimator is especially elegant ift = x,

(9.5) f (m)n (x) =

m! θmhmn

, m = 1, . . . , β − 1.

The rate of convergence becomes slower as m increases. In Exercise 9.65,an analogue of Theorem 9.6 is stated with the rate n−(β−m)/(2β+1). �

9.2. Polynomial Approximation and Regularity of Design

In a further study of the local polynomial approximation, we introduce someregularity rules for a design to guarantee Assumptions 9.2 and 9.5. Thelemmas that we state in this section will be proved in Section 9.4.

Page 131: Mathematical Statistics - 213.230.96.51:8090

120 9. Local Polynomial Approximation of the Regression Function

9.2.1. Regular Deterministic Design. Recall that according to (7.18),the design points are defined on the interval [0, 1] as the quantiles of adistribution with a continuous strictly positive probability density p(x).

Lemma 9.8. Let the regular deterministic design be defined by (7.18), andsuppose the bandwidth hn satisfies the conditions hn → 0 and nhn → ∞as n → ∞. Let N denote the number of the design points in the interval[x− hn, x+ hn]. Then:

(i) xi+1 − xi = (1 + αi, n)/(np(x)) where max1≤ i≤N |αi, n| → 0 asn → ∞.

(ii) limn→∞N/(nhn) = 2p(x).(iii) For any continuous function ϕ0(u), u ∈ [−1, 1],

limn→∞

1

nhn

N∑

i=1

ϕ0

( xi − x

hn

)= p(x)

∫ 1

−1ϕ0(u) du.

Define a matrix D−1∞ with the (l,m)-th element given by

(9.6) (D−1∞ )l,m =

⎧⎪⎨

⎪⎩

1

2

∫ 1

−1ul+m du =

1

2(l +m+ 1), if l +m is even,

0, if l +m is odd.

The matrix D−1∞ has the inverse D∞ (for a proof see Exercise 9.66). The

matrix D∞ is a limiting covariance matrix introduced in Chapter 7.

Lemma 9.9. Suppose the assumptions of Lemma 9.8 hold. Then the fol-lowing limit exists:

limn→∞

N−1(G′G

)= D−1

∞ ,

and the limiting matrix is invertible.

Corollary 9.10. Under the conditions of Lemma 9.8, Assumption 9.2 isfulfilled for all sufficiently large n, and Assumption 9.5 holds with any con-stant γ1 < 2p(x).

Corollary 9.11. For the regular deterministic design, the local polynomialestimator f∗

n(x) with the bandwidth h∗n = n−1/(2β+1) has the quadratic riskat x ∈ (0, 1) bounded by r∗ n−2β/(2β+1) where a positive constant r∗ is inde-pendent of n.

9.2.2. Random Uniform Design. To understand the key difficulties withthe random design, it suffices to look at the case of the uniformly distributeddesign points xi on the interval [0, 1] . For this design the regularity in thedeterministic sense does not hold. That is, it cannot be guaranteed withprobability 1 that the distances between two consecutive points are O(1/n)as n → ∞. With a positive probability there may be no design points in the

Page 132: Mathematical Statistics - 213.230.96.51:8090

9.2. Polynomial Approximation and Regularity of Design 121

interval [x − hn, x + hn], or it may contain some points but the system ofthe normal equations (9.2) may be singular (see Exercise 7.51).

In what follows, we concentrate on the case of the optimal bandwidthh∗n = n−1/(2β+1). Take a small fixed positive number α < 1, and introducethe random event

A ={ ∣∣∣

N

nh∗n− 2∣∣∣ ≤ α

}.

As in the case of the deterministic design, introduce the same matrixD−1

∞ = limn→∞N−1(G′G

)and its inverse D∞. Denote by C∗ a constant

that exceeds the absolute values of all elements of D∞. Define another ran-dom event

B ={ ∣∣∣(G′G

)−1

l,m

∣∣∣ ≤2C∗nh∗n

for all l,m = 0, . . . , β − 1}.

Note that these random events depend on n, but this fact is suppressed inthe notation.

Recall that the local polynomial estimator (9.4) at t = x is the inter-

cept θ0. In the case of the random uniform design, we redefine the localpolynomial estimator as

(9.7) f∗n(x) =

{θ0, if A∩ B occurs,

0, otherwise.

If the random event A occurs, then Assumption 9.5 holds with γ1 = 2− α.If also the event B occurs, then Assumption 9.2 holds with γ0 = 2(2+α)C∗.Thus, if both events take place, we can anticipate an upper bound for thequadratic risk similar to the one in Theorem 9.6. If f∗

n(x) = 0, this estimatordoes not estimate the regression function at all. Fortunately, as follows fromthe two lemmas below (see Remark 9.14), the probability that either A orB fails is negligible as n → ∞. Proofs of these lemmas can be found in thelast section.

Lemma 9.12. Let A be the complement of the event A. Then

Pf

(A)≤ 2α−2 n−2β/(2β+1).

Lemma 9.13. Let B denote the complement of the event B. Then thereexists a positive number C, independent of n, such that

Pf

(B)≤ C n−2β/(2β+1).

Remark 9.14. Applying Lemmas 9.12 and 9.13, we see that

Pf

(f∗n(x) = 0

)= Pf

(A ∩ B

)

= Pf

(A∪ B

)≤ Pf

(A)+ Pf

(B)

≤ 2α−2 n−2β/(2β+1) + C n−2β/(2β+1) → 0 as n → ∞. �

Page 133: Mathematical Statistics - 213.230.96.51:8090

122 9. Local Polynomial Approximation of the Regression Function

Now, we are in the position to prove the main result for the quadraticrisk under the random uniform design.

Theorem 9.15. Take the optimal bandwidth h∗n = n−1/(2β+1). Let the de-sign X be random and uniform on [0, 1]. Then the quadratic risk of the localpolynomial estimator f∗

n(x) at x defined by (9.7) satisfies the upper bound

supf ∈Θ(β)

Ef

[ (f∗n(x)− f(x)

)2 ] ≤ r∗∗ n−2β/(2β+1)

where a positive constant r∗∗ is independent of n.

Proof. Note that in the statement of Theorem 9.6, the constant r∗ de-pends on the design X only through the constants γ0 and γ1 that appearin Assumptions 9.2 and 9.5. Thus, if the assumptions hold, then r∗ is non-random, and averaging over the distribution of the design points does notaffect the upper bound. Hence,

Ef

[ (f∗n(x)− f(x)

)2I(A ∩ B

) ]≤ r∗ n−2β/(2β+1).

Applying this inequality and Lemmas 9.12 and 9.13, we have that for allsufficiently large n and for any f ∈ Θ(β, L, L1),

Ef

[ (f∗n(x)− f(x)

)2 ] ≤ Ef

[ (f∗n(x)− f(x)

)2I(A ∩ B

) ]

+Ef

[ (f∗n(x)− f(x)

)2I(A) ]

+ Ef

[ (f∗n(x)− f(x)

)2I(B) ]

≤ r∗ n−2β/(2β+1) + L21

[Pf

(A)+ Pf

(B) ]

≤[r∗ + 2L2

1α−2 + CL2

1

]n−2β/(2β+1).

Finally, we choose r∗∗ = r∗ + 2L21α

−2 + CL21, and the result follows. �

9.3. Asymptotically Minimax Lower Bound

For the quadratic risk at a point, the results of the previous sections confirmthe existence of estimators with the asymptotic rate of convergence ψn =n−β/(2β+1) in the sense of the definition (8.5). This rate is uniform over theHolder class of regression functions Θ(β). To make sure that we do not missany better estimator with a faster rate of convergence, we have to prove thelower bound for the minimax risk. In this section, we show that for all largen, and for any estimator fn of the regression function f , the inequality

(9.8) supf ∈Θ(β)

Ef

[ (fn(x)− f(x)

)2 ] ≥ r∗ n−2β/(2β+1)

holds with a positive constant r∗ independent of n.

Page 134: Mathematical Statistics - 213.230.96.51:8090

9.3. Asymptotically Minimax Lower Bound 123

Clearly, the inequality (9.8) does not hold for any design X . For example,if all the design points are concentrated at one point x1 = · · · = xn = x, thenour observations (xi, yi) are actually observations in the parametric model

yi = f(x) + εi, i = 1, . . . , n,

with a real-valued parameter θ = f(x). This parameter can be estimated1/

√n-consistently by the simple averaging of the response values yi. On the

other hand, if the design points x1, . . . , xn are regular, then the lower bound(9.8) turns out to be true.

9.3.1. Regular Deterministic Design. We start with the case of a de-terministic regular design, and prove the following theorem.

Theorem 9.16. Let the deterministic design points be defined by (7.18)with a continuous and strictly positive density p(x), x ∈ [0, 1]. Then for anyfixed x, the inequality (9.8) holds.

Proof. To prove the lower bound in (9.8), we use the same trick as inthe parametric case (refer to the proof of Lemma 3.4). We substitute thesupremum over Θ(β) by the Bayes prior distribution concentrated at twopoints. This time, however, the two points are represented by two regressionfunctions, called the test functions,

f0 = f0(x) = 0 and f1 = f1(x) = 0 , f1 ∈ Θ(β), x ∈ [0, 1].

Note that for any estimator fn = fn(x), the supremum exceeds the meanvalue,

supf ∈Θ(β)

Ef

[ (fn(x)− f(x)

)2 ]

(9.9) ≥ 1

2Ef0

[f2n(x)

]+

1

2Ef1

[ (fn(x)− f1(x)

)2 ].

The expected values Ef0 and Ef1 denote the integration with respect tothe distribution of yi, given the corresponding regression function. Under thehypothesis f = f0 = 0, the response yi = εi ∼ N (0, σ2), while under thealternative f = f1, yi ∼ N

(f1(xi), σ

2). Changing the probability measure

of integration, we can write the expectation Ef1 in terms of Ef0 ,

Ef1

[ (fn(x)− f1(x)

)2 ]

= Ef0

[ (fn(x)− f1(x)

)2n∏

i=1

exp{− (yi − f1(xi))

2/(2σ2)}

exp{− y2i /(2σ

2)}

]

(9.10) = Ef0

[ (fn(x)− f1(x)

)2exp

{ n∑

i=1

yif1(xi)

σ2−

n∑

i=1

f21 (xi)

2σ2

}].

Page 135: Mathematical Statistics - 213.230.96.51:8090

124 9. Local Polynomial Approximation of the Regression Function

Now, for a given Holder class Θ(β, L, L1), we will explicitly introduce afunction f1 that belongs to this class. Take a continuous function ϕ(u), u ∈R. We assume that it is supported on the interval [−1, 1], is positive at theorigin, and its β-th derivative is bounded by L. That is, we assume thatϕ(u) = 0 if |u | > 1, ϕ(0) > 0, and |ϕ(β)(u) | ≤ L. Choose the bandwidth

h∗n = n−1/(2β+1), and put

f1(t) = (h∗n)β ϕ( t− x

h∗n

), t ∈ [0, 1].

Schematic graphs of the functions ϕ and f1 are given in Figures 6 and 7.These graphs reflect a natural choice of the function ϕ as a “bump”. Noticethat f1 is a rescaling of ϕ. Indeed, since the support of ϕ is [−1, 1], thefunction f1 is non-zero only for t such that |(t− x)/h∗n| ≤ 1 or, equivalently,for t ∈ [x−h∗n, x+h∗n], and the value of f1 at x is small, f1(x) = (h∗n)

β ϕ(0).

�ϕ(u)

ϕ(0)

u−1 10

Figure 6. A graph of a “bump” function ϕ.

�f1(t)

t10 x− h∗n x x+ h∗n

(h∗n)β ϕ(0)

Figure 7. A graph of the function f1.

For any n sufficiently large, the function f1 belongs to Θ(β, L, L1). In-deed, since the function ϕ is bounded and h∗n is small, |f1| is bounded byL1. Also, for any t1, t2 ∈ [0, 1],

| f (β−1)1 (t1)− f

(β−1)1 (t2) | =

∣∣∣h∗n ϕ

(β−1)(t1 − x

h∗n

)− h∗n ϕ

(β−1)( t2 − x

h∗n

) ∣∣∣

≤ h∗n max−1≤u≤ 1

∣∣ϕ(β)(u)

∣∣∣∣∣t1 − x

h∗n− t2 − x

h∗n

∣∣∣ ≤ L | t1 − t2 |.

Page 136: Mathematical Statistics - 213.230.96.51:8090

9.3. Asymptotically Minimax Lower Bound 125

Introduce a random event

E ={ n∑

i=1

yi f1(xi)

σ2−

n∑

i=1

f21 (xi)

2σ2≥ 0

}.

From (9.9) and (9.10), we find that

supf ∈Θ(β)

Ef

[ (fn(x)− f(x)

)2 ]

≥ 1

2Ef0

[f2n(x) +

(fn(x)− f1(x)

)2exp

{ n∑

i=1

yi f1(xi)

σ2−

n∑

i=1

f21 (xi)

2σ2

}]

≥ 1

2Ef0

[ (f2n(x) +

(fn(x)− f1(x)

)2 )I(E)

].

Here we bound the exponent from below by one, which is true under theevent E . Next, by the elementary inequality a2 + (a − b)2 ≥ b2/2 with

a = fn(x) and b = f1(x) = (h∗n)β ϕ(0), we get the following bound:

(9.11) supf ∈Θ(β)

Ef

[ (fn(x)− f(x)

)2 ] ≥ 1

4(h∗n)

2β ϕ2(0) Pf0( E ).

What is left to show is that the probability Pf0( E ) is separated away fromzero,

(9.12) Pf0( E ) ≥ p0

where p0 is a positive constant independent of n. In this case, (9.8) holds,

(9.13) supf ∈Θ(β)

Ef

[ (fn(x)−f(x)

)2 ] ≥ 1

4(h∗n)

2β ϕ2(0) p0 = r∗ n−2β/(2β+1)

with r∗ = (1/4)ϕ2(0) p0. To verify (9.12), note that under the hypothesisf = f0 = 0, the random variable

Z =[σ2

n∑

i=1

f21 (xi)

]−1/2n∑

i=1

yi f1(xi)

has the standard normal distribution. Thus,

limn→∞

Pf0( E ) = limn→∞

Pf0

( n∑

i=1

yi f1(xi) ≥1

2

n∑

i=1

f21 (xi)

)

= limn→∞

Pf0

(Z ≥ 1

[ n∑

i=1

f21 (xi)

]1/2)= 1− lim

n→∞Φ( 1

[ n∑

i=1

f21 (xi)

]1/2).

Finally, we will show that

(9.14) limn→∞

n∑

i=1

f21 (xi) = p(x) ‖ϕ‖22 = p(x)

∫ 1

−1ϕ2(u) du > 0.

Page 137: Mathematical Statistics - 213.230.96.51:8090

126 9. Local Polynomial Approximation of the Regression Function

Indeed, recall that the optimal bandwidth h∗n satisfies the identity (h∗n)2β =

1/(nh∗n). Using this fact and the assertion of part (iii) of Lemma 9.8, wehave that

n∑

i=1

f21 (xi) =

n∑

i=1

(h∗n)2β ϕ2

( xi − x

h∗n

)

(9.15) =1

nh∗n

n∑

i=1

ϕ2( xi − x

h∗n

)→ p(x)

∫ 1

−1ϕ2(u) du as n → ∞.

Hence (9.14) is true, and the probability Pf0( E ) has a strictly positive limit,

limn→∞

Pf0( E ) = 1 − Φ( 1

[p(x) ‖ϕ‖22

]1/2 )> 0.

This completes the proof of the theorem. �

9.3.2. Regular Random Design. Do random designs represent all thepoints of the interval [0, 1] “fairly” to ensure the lower bound (9.8)? Itseems plausible, provided the probability density of the design is strictlypositive. The following theorem supports this view.

Theorem 9.17. Let the design points x1, . . . , xn be independent identicallydistributed random variables with the common probability density p(x) whichis continuous and strictly positive on [0, 1]. Then at any fixed x ∈ (0, 1), theinequality (9.8) holds.

Proof. See Exercise 9.68. �

9.4. Proofs of Auxiliary Results

Proof of Lemma 9.8. (i) Consider the design points in the hn-neighborhoodof x. By the definition (7.17) of the regular deterministic design points, wehave

1

n=

i+ 1

n− i

n= FX (xi+1) − FX (xi) = p(x∗i )(xi+1 − xi)

where x∗i ∈ (xi, xi+1) . Hence,

xi+1 − xi =1

np(x∗i ).

From the continuity of the density p(x), we have that p(x∗i )(1+αi, n) = p(x)where αi, n = o(1) → 0 as n → ∞. Therefore,

xi+1 − xi =1 + αi, n

np(x).

The quantity |αi, n| can be bounded by a small constant uniformly overi = 1, . . . , n, so that αn = max1≤i≤N |αi, n| → 0 as n → ∞.

Page 138: Mathematical Statistics - 213.230.96.51:8090

9.4. Proofs of Auxiliary Results 127

(ii) Note that by definition, the number N of observations in the interval[x− hn, x+ hn] can be bounded by

2hnmax

1≤ i≤N(xi+1 − xi)

− 1 ≤ N ≤ 2hnmin

1≤ i≤N(xi+1 − xi)

+ 1.

From part (i),1− αn

np(x)≤ xi+1 − xi ≤ 1 + αn

np(x),

and, therefore, N is bounded by

(9.16)2hnnp(x)

1 + αn− 1 ≤ N ≤ 2hnnp(x)

1− αn+ 1.

Hence, limn→∞N/(nhn) = 2p(x).

(iii) Put ui = (xi+1 − x)/hn. From part (i), we have that

(9.17) Δui = ui+1 − ui =xi+1 − xi

hn=

1 + αi, n

nhnp(x),

or, equivalently,1

nhn= p(x)

Δui1 + αi, n

.

Hence, the bounds take place:

p(x)Δui

1 + αn≤ 1

nhn≤ p(x)

Δui1− αn

.

Consequently,

p(x)

1 + αn

N∑

i=1

ϕ0

( xi − x

hn

)Δui ≤ 1

nhn

N∑

i=1

ϕ0

( xi − x

hn

)

≤ p(x)

1− αn

N∑

i=1

ϕ0

( xi − x

hn

)Δui,

and the desired convergence follows,

1

nhn

N∑

i=1

ϕ0

( xi − x

hn

)→ p(x)

∫ 1

−1ϕ0(u) du. �

Proof of Lemma 9.9. By the definition of the matrix G, we can write

(9.18)1

N

(G′G

)l,m

=1

N

N∑

i=1

( xi − x

hn

)l ( xi − x

hn

)m=

1

N

N∑

i=1

ul+mi .

Next, we want to find bounds for 1/N . From (9.17), we have

1− αn

nhnp(x)≤ Δui ≤ 1 + αn

nhnp(x).

Page 139: Mathematical Statistics - 213.230.96.51:8090

128 9. Local Polynomial Approximation of the Regression Function

Combining this result with (9.16), we obtain( 2hnnp(x)

1 + αn− 1

)( 1− αn

nhnp(x)

)≤ NΔui ≤

( 2hnnp(x)

1− αn+ 1

)( 1 + αn

nhnp(x)

).

Put

βn =1− αn

1 + αn− 1− αn

nhnp(x)− 1 and βn =

1 + αn

1− αn+

1− αn

nhnp(x)− 1.

Thus, we have shown that 2 + βn ≤ NΔui ≤ 2 + βn , or, equivalently,

Δui

2 + βn≤ 1

N≤ Δui

2 + βn

where βn and βn vanish as n goes to infinity. Therefore, using the expression(9.18), we can bound 1

N

(G′G

)l,m

by

1

2 + βn

N∑

i=1

ul+mi Δui ≤ 1

N

(G′G

)l,m

≤ 1

2 + βn

N∑

i=1

ul+mi Δui.

Both bounds converge as n → ∞ to the integral in the definition (9.6)of D−1

∞ . The proof that this matrix is invertible is left as an exercise (seeExercise 9.66). �

Before we turn to Lemmas 9.12 and 9.13, we prove the following result.Let g(u) be a continuous function such that |g(u)| ≤ 1 for all u ∈ [−1, 1].Let x1, . . . , xn be independent random variables with a common uniformdistribution on [0, 1]. Introduce the independent random variables ηi, i =1, . . . , n, by

(9.19) ηi =

⎧⎨

⎩g( xi − x

h∗n

), if xi ∈ [x− h∗n, x+ h∗n],

0, otherwise.

Denote by μn the expected value of ηi,

μn = E[ ηi ] =

∫ x+h∗n

x−h∗n

g( t− x

h∗n

)dt = h∗n

∫ 1

−1g(u) du.

Result. For any positive number α,

(9.20) P

( ∣∣∣1

nh∗n(η1 + · · ·+ ηn) − μn

h∗n

∣∣∣ > α

)≤ 2

α2nh∗n.

Proof. Note that

Var[ ηi ] ≤ E[ η2i ] = h∗n

∫ 1

−1g2(u) du ≤ 2h∗n.

Thus, the Chebyshev inequality yields

P

( ∣∣∣

1

nh∗n(η1 + · · ·+ ηn) − μn

h∗n

∣∣∣ > α

)≤ nVar[ ηi ]

(αnh∗n)2

≤ 2

α2nh∗n. �

Page 140: Mathematical Statistics - 213.230.96.51:8090

9.4. Proofs of Auxiliary Results 129

Proof of Lemma 9.12. Apply the definition (9.19) of ηi with g = 1. Inthis case, N = η1 + · · ·+ ηn and μn/h

∗n = 2. Thus, from (9.20) we obtain

Pf

(A) = Pf

( ∣∣∣N

nh∗n− 2

∣∣∣ > α)

= Pf

( ∣∣∣1

nh∗n(η1 + · · ·+ ηn) − 2

∣∣∣ > α

)≤ 2

α2nh∗n.

Finally, note that nh∗n = n2β/(2β+1). �Proof of Lemma 9.13. For an arbitrarily small δ > 0, define a randomevent

C =

β−1⋂

l,m=0

{ ∣∣∣

1

2nh∗n(G′G)l,m − (D−1

∞ )l,m

∣∣∣ ≤ δ

}.

First, we want to show that the probability of the complement event

C =

β−1⋃

l,m=0

{ ∣∣∣1

2nh∗n(G′G)l,m − (D−1

∞ )l,m

∣∣∣ > δ

}

is bounded from above,

(9.21) Pf

(C)≤ 2β2δ−2n−2β/(2β+1).

To see this, put g(u) = (1/2)u l+m in (9.19). Then

1

2nh∗n(G′G)l,m =

η1 + . . . + ηnnh∗n

and

μn

h∗n=

1

h∗n

∫ x+h∗n

x−h∗n

g( t− x

h∗n

)dt =

1

2

∫ 1

−1ul+m du = (D−1

∞ )l,m.

The inequality (9.20) provides the upper bound 2 δ−2 n−2β/(2β+1) for theprobability of each event in the union C. This proves (9.21).

Next, recall that we denoted by C∗ a constant that exceeds the absolutevalues of all elements of the matrix D∞. Due to the continuity of a matrixinversion, for any ε ≤ C∗, there exists a number δ = δ(ε) such that

C =

β−1⋂

l,m=0

{ ∣∣∣1

2nh∗n(G′G)l,m − (D−1

∞ )l,m

∣∣∣ ≤ δ(ε)

}

⊆β−1⋂

l,m=0

{ ∣∣∣ (2nh∗n)(G

′G)−1l,m − (D∞)l,m

∣∣∣ ≤ ε

}

⊆{ ∣∣∣(G′G

)−1

l,m

∣∣∣ ≤

2C∗nh∗n

for all l,m = 0, . . . , β − 1}= B.

Page 141: Mathematical Statistics - 213.230.96.51:8090

130 9. Local Polynomial Approximation of the Regression Function

The latter inclusion follows from the fact that if (G′G)−1l,m ≤ (C∗+ε)/(2nh∗n)

and ε ≤ C∗, it implies that (G′G)−1l,m ≤ C∗/(nh∗n) ≤ 2C∗/(nh∗n). Thus,

from (9.21), we obtain Pf (B) ≤ Pf (C) ≤ Cn−2β/(2β+1) with C = 2β2δ−2.�

Exercises

Exercise 9.63. Explain what happens to the local polynomial estimator(9.4) if one of the conditions hn → 0 or nhn → ∞ is violated.

Exercise 9.64. Take x = hn, and consider the local polynomial approx-imation in the interval [0, 2hn]. Let the estimate of the regression coef-ficient be defined as the solution of the respective minimization problem(9.1). Define the estimator of the regression function at the origin by

fn(0) =∑β−1

m=0(−1)m θm. Find upper bounds for the conditional bias and

variance of fn(0) for a fixed design X .

Exercise 9.65. Prove an analogue of Theorem 9.6 for the derivative esti-mator (9.5),

supf ∈Θ(β)

Ef

[ ( m! θm(h∗n)

m− f (m)(x)

)2 ∣∣∣X]≤ r∗ n−2(β−m)/(2β+1)

where h∗n = n−1/(2β+1).

Exercise 9.66. Show that the matrix D−1∞ in Lemma 9.9 is invertible.

Exercise 9.67. Let f1 be as defined in the proof of Theorem 9.16, andlet x1, . . . , xn be a random design with the probability density p(x) on theinterval [0, 1].(i) Show that the random variable

n∑

i=1

f21 (xi) =

1

nh∗n

n∑

i=1

ϕ2( xi − x

h∗n

)

has the expected value that converges to p(x) ‖ϕ ‖22 as n → ∞.(ii) Prove that the variance of this random variable is O(1/(nhn)) as n → ∞.(iii) Derive from parts (i) and (ii) that for all sufficiently large n,

Pf0

( n∑

i=1

f21 (xi) ≤ 2p(x) ‖ϕ ‖22

)≥ 1/2.

Exercise 9.68. Apply Exercise 9.67 to prove Theorem 9.17.

Page 142: Mathematical Statistics - 213.230.96.51:8090

Chapter 10

Estimation ofRegression in GlobalNorms

10.1. Regressogram

In Chapters 8 and 9, we gave a detailed analysis of the kernel and localpolynomial estimators at a fixed point x inside the interval (0, 1). The as-

ymptotic minimax rate of convergence was found to be ψn = n−β/(2β+1),which strongly depends on the smoothness parameter β of the regressionfunction.

What if our objective is different? What if we want to estimate theregression function f(x) as a curve in the interval [0, 1]? The global normsserve this purpose. In this chapter, we discuss the regression estimationproblems with regard to the continuous and discrete L2-norms, and sup-norm.

In the current section, we introduce an estimator fn, called a regresso-gram. A formal definition will be given at the end of the section.

When it comes to the regression estimation in the interval [0, 1], we canextend a smoothing kernel estimator (8.16) to be defined in the entire unitinterval. However, the estimation at the endpoints x = 0 and x = 1 wouldcause difficulties. It is more convenient to introduce an estimator definedeverywhere in [0, 1] based on the local polynomial estimator (9.4).

Consider a partition of the interval [0, 1] into small subintervals of theequal length 2hn. To ease the presentation assume that Q = 1/(2hn) is

131

Page 143: Mathematical Statistics - 213.230.96.51:8090

132 10. Estimation of Regression in Global Norms

an integer. The number Q represents the total number of intervals in thepartition. Each small interval

Bq =[2(q − 1)hn, 2qhn

), q = 1, . . . , Q,

is called a bin. It is convenient to introduce notation for the midpoint of thebin Bq. We denote it by cq = (2q − 1)hn, q = 1, . . . , Q.

The local polynomial estimator (9.4) is defined separately for each bin.If we want to estimate the regression function at every point x ∈ [0, 1], wemust consider a collection of the local polynomial estimators. Introduce Qminimization problems, one for each bin,

n∑

i=1

(yi−

[θ0, q + θ1, q

( xi − cqhn

)+ · · ·+ θβ−1, q

( xi − cqhn

)β−1 ] )2I(xi ∈ Bq

)

(10.1) → minθ0, q ,...,θβ−1, q

for x ∈ Bq, q = 1, . . . , Q.

Note that these minimization problems are totally disconnected. Each ofthem involves only the observations the design points of which belong to therespective bin Bq. The estimates of the regression coefficients are markedby the double subscript, representing the coefficient number and the binnumber. There should also be a subscript “n”, which we omit to avoid toocumbersome a notation.

As in Section 9.1, it is easier to interpret the minimization problems(10.1) if they are written in the vector notation. Denote by N1, . . . , NQ thenumber of the design points in each bin, N1 + · · · + NQ = n. For a fixedq = 1, . . . , Q, let

x1,q < · · · < xNq ,q

be the design points in the bin Bq, and let the corresponding response valueshave matching indices y1,q, . . . , yNq ,q . Denote by

θq =(θ0,q, . . . , θβ−1, q

)′

the vector of the estimates of the regression coefficients in the q-th bin. Thevectors θq satisfy the systems of normal equations

(10.2)(G′

qGq

)θq = G′

qyq, q = 1, . . . , Q,

where yq = (y1, q, . . . , yNq , q)′, and the matrix Gq =

[g0, q, . . . ,gβ−1, q

]has

the columns

gm, q =(( x1, q − cq

hn

)m, . . . ,

( xNq, q − cq

hn

)m )′, m = 0, . . . , β − 1.

The results of Section 9.1 were based on Assumptions 9.2 and 9.5. Inthis section, we combine their analogues into one assumption. Provided this

Page 144: Mathematical Statistics - 213.230.96.51:8090

10.2. Integral L2-Norm Risk for the Regressogram 133

assumption holds, the systems of normal equations (10.2) have the uniquesolutions for all q = 1, . . . , Q.

Assumption 10.1. There exist positive constants γ0 and γ1, independentof n and q, such that for all q = 1, . . . , Q,

(i) the absolute values of the elements of the matrix(Gq

′Gq

)−1are bounded

from above by γ0/Nq,(ii) the number of observations Nq in the q-th bin is bounded from below,Nq ≥ γ1nhn. �

Now we are ready to define the piecewise polynomial estimator fn(x)in the entire interval [0, 1]. This estimator is called a regressogram, and iscomputed according to the formula(10.3)

fn(x) = θ0, q + θ1, q

( x− cqhn

)+ . . . + θβ−1, q

( x− cqhn

)β−1if x ∈ Bq,

where the estimates θ0, q, . . . , θβ−1, q satisfy the normal equations (10.2), q =1, . . . , Q.

10.2. Integral L2-Norm Risk for the Regressogram

Consider the regressogram fn(x), x ∈ [0, 1], defined by (10.3). The following

statement is an adaptation of Proposition 9.4 about the components of θq.We omit its proof.

Proposition 10.2. Suppose that for a given design X , Assumption 10.1holds. Assume that the regression function f belongs to a Holder classΘ(β, L, L1). Then the m-th element θm,q of the vector θq, which satisfiesthe system of normal equations (10.2), admits the expansion

θm,q =f (m)(cq)

m!hmn + bm,q + Nm,q, m = 0, . . . , β − 1, q = 1, . . . , Q,

where the conditional bias bm,q is bounded from above,∣∣bm, q

∣∣ ≤ Cbhβn, and

the stochastic term Nm,q has the normal distribution with mean zero. Itsvariance is limited from above, Varf

[Nm,q

∣∣X]≤ Cv/Nq. Here the con-

stants Cb and Cv are independent of n. Conditionally, given the design X ,the random variables Nm,q are independent for different values of q.

The next theorem answers the question about the integral L2-norm riskfor the regressogram.

Theorem 10.3. Let a design X be such that Assumption 10.1 holds withthe bandwidth h∗n = n−1/(2β+1). Then the mean integrated quadratic risk of

Page 145: Mathematical Statistics - 213.230.96.51:8090

134 10. Estimation of Regression in Global Norms

the regressogram fn(x) admits the upper bound

supf ∈Θ(β)

Ef

[ ∫ 1

0

(fn(x) − f(x)

)2dx∣∣X]≤ r∗ n−2β/(2β+1)

for some positive constant r∗ independent of n.

Proof. From Lemma 8.5, for any f ∈ Θ(β, L, L1), and for any bin Bq

centered at cq, the Taylor expansion is valid

f(x) = f(cq) + f (1)(cq)(x− cq) + · · · + f (β−1)(cq)

(β − 1)!(x− cq)

β−1 + ρ(x, cq)

=

β−1∑

m=0

f (m)(cq)

m!(h∗n)

m( x− cq

h∗n

)m+ ρ(x, cq)

where the remainder term ρ(x, cq) satisfies the inequality∣∣ρ(x, cq)

∣∣ ≤ Cρ(h∗n)

β

with Cρ = L/(β − 1)! .

Applying Proposition 10.2, the Taylor expansion of f , and the definitionof the regressogram (10.3), we get the expression for the quadratic risk

Ef

[ ∫ 1

0

(fn(x)− f(x)

)2dx∣∣X]=

Q∑

q=1

Ef

[ ∫

Bq

(fn(x)− f(x)

)2dx∣∣∣X]

=

Q∑

q=1

Ef

[ ∫

Bq

[ β−1∑

m=0

(bm,q + Nm,q

) (x− cqh∗n

)m− ρ(x, cq)

]2dx∣∣∣X].

Using the fact that the random variables Nm,q have mean zero, we canwrite the latter expectation of the integral over Bq as the sum of a deter-ministic and stochastic terms,

Bq

( β−1∑

m=0

bm,q

(x− cqh∗n

)m− ρ(x, cq)

)2dx

+

Bq

Ef

[( β−1∑

m=0

Nm,q

(x− cqh∗n

)m)2∣∣∣X]dx.

From the bounds for the bias and the remainder term, the first integrandcan be estimated from above by a constant,

( β−1∑

m=0

bm,q

( x− cqh∗n

)m− ρ(x, cq)

)2≤( β−1∑

m=0

|bm,q| + |ρ(x, cq)|)2

≤(βCb(h

∗n)

β + Cρ(h∗n)

β)2

= CD(h∗n)

2β = CDn−2β/(2β+1)

where CD = (βCb + Cρ)2.

Page 146: Mathematical Statistics - 213.230.96.51:8090

10.2. Integral L2-Norm Risk for the Regressogram 135

Note that the random variables Nm,q may be correlated for a fixed qand different m’s. Using a special case of the Cauchy-Schwarz inequality(a0 + · · · + aβ−1)

2 ≤ β (a20 + · · · + a2β−1), Proposition 10.2, and Assumption10.1, we bound the second integrand from above by

Ef

[ ( β−1∑

m=0

Nm,q

( x− cqh∗n

)m )2 ∣∣∣X]≤ β

β−1∑

m=0

Varf[Nm,q | X

]

≤ β2Cv

Nq≤ β2Cv

γ1nh∗n=

CS

nh∗n= CS n−2β/(2β+1) where CS = β2Cv/γ1.

Thus, combining the deterministic and stochastic terms, we arrive at theupper bound

Ef

[ ∫ 1

0

(fn(x) − f(x)

)2dx∣∣X]≤

Q∑

q=1

Bq

(CD + CS

)n−2β/(2β+1) dx

= r∗n−2β/(2β+1) with r∗ = CD + CS . �

Remark 10.4. Under Assumption 10.1, the results of Lemmas 9.8 and 9.9stay valid uniformly over the bins Bq, q = 1, . . . , Q. Therefore, we canextend the statement of Corollary 9.11 to the integral L2-norm. For theregular deterministic design, the unconditional quadratic risk in the integralL2-norm of the regressogram fn with the bandwidth h∗n = n−1/(2β+1) admitsthe upper bound

supf ∈Θ(β)

Ef

[ ∫ 1

0

(fn(x) − f(x)

)2dx]≤ r∗ n−2β/(2β+1)

where a positive constant r∗ is independent of n. A similar result is also truefor the regular random design (cf. Theorem 9.15). Unfortunately, it is tootechnical, and we skip its proof. �

Remark 10.5. The m-th derivative of the regressogram (10.3) has the form

dm fn(x)

d xm=

β−1∑

i=m

θi,qi!

(i−m)!

1

hmn

( x − cqhn

)i−mif x ∈ Bq, 0 ≤ m ≤ β−1.

Under the same choice of the bandwidth, h∗n = n−1/(2β+1), this estimatoradmits the upper bound similar to the one in Theorem 10.3 with the raten−(β−m)/(2β+1), that is,(10.4)

supf ∈Θ(β)

Ef

[ ∫ 1

0

( dm fn(x)

d xm− dm f(x)

d xm

)2dx∣∣∣X]≤ r∗ n−2(β−m)/(2β+1).

For the proof see Exercise 10.69. �

Page 147: Mathematical Statistics - 213.230.96.51:8090

136 10. Estimation of Regression in Global Norms

10.3. Estimation in the Sup-Norm

In this section we study the asymptotic performance of the sup-norm riskof the regressogram fn(x) defined in (10.3). The sup-norm risk for a fixeddesign X is given by

(10.5) Ef

[‖ fn − f ‖∞

∣∣∣X]= Ef

[sup

0≤x≤ 1| fn(x) − f(x) |

∣∣∣X].

Our starting point is Proposition 10.2. It is a very powerful result that allowsus to control the risk under any loss function. We use this proposition toprove the following theorem.

Theorem 10.6. Let a design X be such that Assumption 10.1 holds with

the bandwidth hn =((lnn)/n

)1/(2β+1). Let fn be the regressogram that

corresponds to this bandwidth. Then the conditional sup-norm risk (10.5)admits the upper bound

(10.6) supf ∈Θ(β)

Ef

[‖ fn − f ‖∞

∣∣∣X]≤ r∗

( lnn

n

)β/(2β+1)

where r∗ is a positive constant independent of n and f .

Proof. Applying Lemma 8.5, we can write the sup-norm of the differencefn − f as

‖ fn − f ‖∞ = max1≤ q≤Q

supx∈Bq

∣∣∣β−1∑

m=0

θm, q

( x − cqhn

)m

(10.7) −β−1∑

m=0

f (m)(cq) (hn)m

m!

( x − cqhn

)m− ρ(x, cq)

∣∣∣

where Q = 1/(2hn) is the number of bins, and the q-th bin is the intervalBq =

[cq − hn , cq + hn

)centered at x = cq, q = 1, . . . , Q. The remainder

term ρ(x, cq) satisfies the inequality | ρ(x, cq) | ≤ Cρhβn with the constant

Cρ = L/ (β−1)!. Applying the formula for θm, q from Proposition 10.2 andthe fact that |x− cq|/hn ≤ 1, we obtain that

∥∥ fn − f

∥∥∞ ≤ max

1≤ q≤Qsupx∈Bq

∣∣∣

β−1∑

m=0

(bm,q + Nm,q

)( x − cqhn

)m ∣∣∣ + Cρhβn

(10.8) ≤[βCbh

βn + Cρh

βn

]+ max

1≤ q≤Q

β−1∑

m=0

∣∣Nm,q

∣∣.

Introduce the standard normal random variables

Zm,q =(Varf

[Nm,q

∣∣X] )−1/2Nm,q.

Page 148: Mathematical Statistics - 213.230.96.51:8090

10.3. Estimation in the Sup-Norm 137

From the upper bound on the variance in Proposition 10.2, we find that

(10.9) max1≤ q≤Q

β−1∑

m=0

∣∣∣Nm,q

∣∣∣ ≤ max1≤ q≤Q

√Cv

Nq

β−1∑

m=0

∣∣Zm,q

∣∣ ≤√

Cv

γ1nhnZ∗

where

Z∗ = max1≤ q≤Q

β−1∑

m=0

∣∣Zm,q

∣∣.

Note that the random variables Zm,q are independent for different bins,but may be correlated for different values of m within the same bin.

Putting together (10.8) and (10.9), we get the upper bound for the sup-norm loss,

(10.10)∥∥ fn − f

∥∥∞ ≤

[βCb + Cρ

]hβn +

√Cv

γ1nhnZ∗.

To continue, we need the following technical result, which we ask to beproved in Exercise 10.70.

Result. There exists a constant Cz > 0 such that

(10.11) E[Z∗ ∣∣X

]≤ Cz

√lnn.

Finally, under our choice of hn, it is easily seen that

nhn = n2β/(2β+1) (lnn)1/(2β+1) = (lnn)( n

lnn

)2β/(2β+1)= (lnn)h−2β

n .

These results along with (10.10) yield

Ef

[‖ fn − f ‖∞

∣∣X]≤[βCb + Cρ

]hβn + Cz

√Cv lnn

γ1nhn

≤[βCb + Cρ + Cz

√Cv/γ1

]hβn = r∗ hβn. �

Remark 10.7. As Theorem 10.6 shows, the upper bound of the risk inthe sup-norm contains an extra log-factor as compared to the case of theL2-norm. The source of this additional factor becomes clear from (10.11).

Indeed, the maximum of the random noise has the magnitude O(√lnn) as

n → ∞. That is why the optimum choice of the bandwidth comes from the

balance equation hβn =√(nhn)−1 lnn. �

Page 149: Mathematical Statistics - 213.230.96.51:8090

138 10. Estimation of Regression in Global Norms

10.4. Projection on Span-Space and Discrete MISE

The objective of this section is to study the discrete mean integrated squarederror (MISE) of the regressogram (10.3). The regressogram is a piecewisepolynomial estimator that can be written in the form

fn(x) =

Q∑

q=1

I(x ∈ Bq)

β−1∑

m=0

θm,q

( x− cqhn

)m

where the q-th bin Bq = [ 2(q − 1)hn , 2qhn ), and cq = (2q − 1)hn is itscenter. Here Q = 1/(2hn) is an integer that represents the number of bins.The rates of convergence in the L2-norm and sup-norm found in the previoussections were partially based on the fact that the bias of the regressogram

has the magnitude O(hβn) uniformly in f ∈ Θ(β) at any point x ∈ [0, 1].Indeed, from Proposition 10.2, we get

supf ∈Θ(β)

sup0≤ x≤ 1

∣∣∣Ef

[fn(x) − f(x)

] ∣∣∣

≤ sup1≤ q≤Q

supx∈Bq

β−1∑

m=0

∣∣ bm,q

∣∣∣∣∣x− cqhn

∣∣∣m

≤ Cbβhβn.

In turn, this upper bound for the bias is the immediate consequence of theTaylor’s approximation in Lemma 8.5.

In this section, we take a different approach. Before we proceed, weneed to introduce some notation. Define a set of βQ piecewise monomialfunctions,(10.12)

γm,q(x) = I(x ∈ Bq)( x− cq

hn

)m, q = 1, . . . , Q , m = 0, . . . , β − 1.

The regressogram fn(x) is a linear combination of these monomials,

(10.13) fn(x) =

Q∑

q=1

β−1∑

m=0

θm,qγm,q(x) , 0 ≤ x ≤ 1

where θm,q are the estimates of the regression coefficients in bins.

In what we used above, it was important that any function f(x) ∈ Θ(β)admits an approximation by a linear combination of {γm,q(x)} with the error

not exceeding O(hβn). This property does not exclusively belong to the setof piecewise monomials. We will prove results in the generalized setting forwhich the regressogram is a special case.

What changes should be made if instead of the piecewise monomials(10.12) we use some other functions? In place of the indices m and q inthe monomials (10.12) we will use a single index k for a set of functions

Page 150: Mathematical Statistics - 213.230.96.51:8090

10.4. Projection on Span-Space and Discrete MISE 139

γk(x), k = 1, . . . ,K . The number of functions K = K(n) → ∞ as n → ∞.In the case of the monomials (10.12), the number K = βQ.

Consider the regression observations yi = f(xi) + εi, εi ∼ N (0, σ2),with a Holder regression function f(x) ∈ Θ(β). As before, we assume thatthe design points are distinct and ordered in the interval [0, 1],

0 ≤ x1 < x2 < · · · < xn ≤ 1.

We want to estimate the regression function f(x) by a linear combination

(10.14) fn(x) = θ1 γ1(x) + · · · + θK γK(x), x ∈ [0, 1],

in the least-squares sense over the design points. To do this, we have tosolve the following minimization problem:

(10.15)n∑

i=1

(yi −

[θ1 γ1(xi) + · · · + θK γK(xi)

] )2→ min

θ1 ... θK

.

Define the design matrix Γ with columns

(10.16) γk =(γk(x1), . . . , γk(xn)

)′, k = 1, . . . ,K.

From this definition, the matrix Γ has the dimensions n×K. The vectorϑ = ( θ1, . . . , θK )′ of estimates in (10.15) satisfies the system of normalequations

(10.17) Γ′Γ ϑ = Γ′y

where y = (y1, . . . , yn)′.

Depending on the design X , the normal equations (10.17) may havea unique or multiple solutions. If this system has a unique solution, thenthe estimate fn(x) can be restored at any point x by (10.14). But evenwhen (10.17) does not have a unique solution, we can still approximate theregression function f(x) at the design points, relying on the geometry of theproblem.

In the n-dimensional space of observations Rn, define a linear span-spaceS generated by the columns γk of matrix Γ. With a minor abuse of notation,we also denote by S the operator in R

n of the orthogonal projection on thespan-space S . Introduce a vector consisting of the values of the regressionfunction at the design points,

f = ( f(x1), . . . , f(xn) )′ , f ∈ R

n,

and a vector of estimates at these points,

fn = Sy = ( f(x1), . . . , f(xn) )′.

Note that this projection is correctly defined regardless of whether (10.17)has a unique solution or not.

Page 151: Mathematical Statistics - 213.230.96.51:8090

140 10. Estimation of Regression in Global Norms

Denote by ε = ( ε1, . . . , εn )′ the vector of independent N (0, σ2) - ran-

dom errors. In this notation, we can interpret fn as a vector sum of twoprojections,

(10.18) fn = Sy = Sf + Sε.

Our goal is to find an upper bound on the discrete MISE. Conditionallyon the design X , the discrete MISE has the form

Ef

[ 1n

n∑

i=1

(fn(xi) − f(xi)

)2 ∣∣∣X]=

1

nEf

[‖Sf + Sε − f ‖2 | X

]

(10.19) ≤ 2

n‖Sf − f ‖2 +

2

nEf

[‖Sε ‖2 | X

]

where ‖·‖ is the Euclidean norm in Rn. Here we used the inequality (a+b)2 ≤

2 (a2 + b2).

Denote by dim(S) the dimension of the span-space S. Note that neces-sarily dim(S) ≤ K. In many special cases, this inequality turns into equality,dim(S) = K. For example, it is true for the regressogram under Assumption10.1 (see Exercise 10.72).

Assumption 10.8. There exists δn, δn → 0 as n → ∞, such that for anyf ∈ Θ(β), the inequality is fulfilled

1

n‖S f − f ‖2 ≤ δ2n. �

Proposition 10.9. Let Assumption 10.8 hold. Then the following upperbound on the discrete MISE holds:

(10.20) Ef

[ 1n

n∑

i=1

(fn(xi) − f(xi)

)2 ∣∣∣X]≤ 2δ2n +

2σ2 dim(S)n

.

Proof. The normal random errors εi, i = 1, . . . , n, are conditionally in-dependent, given the design points. Therefore, the square of the Euclideannorm ‖S ε ‖2 has a σ2χ2-distribution with dim(S) degrees of freedom. Thus,

Ef

[‖Sε ‖2 | X

]= σ2 dim(S).

From Assumption 10.8, we find that the right-hand side of (10.19) is boundedfrom above by 2δ2n + 2σ2 dim(S)/n. �

Proposition 10.10. Assume that for any regression function f ∈ Θ(β),there exists a linear combination a1γ1(x) + · · · + aKγK(x) such that at anydesign point xi, the following inequality holds:

(10.21)∣∣ [ a1γ1(xi) + · · · + aKγK(xi)

]− f(xi)

∣∣ ≤ δn, i = 1, . . . , n.

Then the upper bound (10.20) is valid.

Page 152: Mathematical Statistics - 213.230.96.51:8090

10.5. Orthogonal Series Regression Estimator 141

Proof. Recall that S is an operator of orthogonal projection, and, therefore,Sf is the vector in S closest to f . Applying (10.21), we see that

1

n

∥∥Sf − f ‖2 ≤ 1

n

n∑

i=1

∣∣ [ a1γ1(xi) + · · · + aKγK(xi)]− f(xi)

∣∣2 ≤ δ2n,

and Assumption 10.8 holds. �

In the next theorem, we describe the asymptotical performance of thediscrete MISE for the regressogram.

Theorem 10.11. For any fixed design X , the discrete MISE of the regres-sogram fn(x) given by (10.3) satisfies the inequality

(10.22) Ef

[ 1n

n∑

i=1

(fn(xi) − f(xi)

)2 ∣∣X]≤ 2C2

ρh2βn +

2σ2βQ

n

where Cρ = L/(β−1)!. Moreover, under the optimal choice of the bandwidth

h∗n = n−1/(2β+1), there exists a positive constant r∗, independent of n andf ∈ Θ(β), such that the following upper bound holds:

Ef

[ 1n

n∑

i=,1

(fn(xi) − f(xi)

)2 ∣∣X]≤ r∗n−2β/(2β+2).

Proof. In the case of the regressogram, dim(S) ≤ K = βQ. The Taylorapproximation of f(x) in Lemma 8.5 within each bin guarantees the in-

equality (10.21) with δn = Cρhβn. Hence Proposition 10.10 yields the upper

bound (10.22). Now recall that Q/n = 1/(2nhn). Under the optimal choiceof the bandwidth, both terms on the right-hand side of (10.22) have the

same magnitude O(n−2β/(2β+1)). �

10.5. Orthogonal Series Regression Estimator

In this section we take a different approach to estimation of the regressionfunction f . We will be concerned with estimation of its Fourier coefficients.The functional class to which f belongs will differ from the Holder class.

10.5.1. Preliminaries. A set of functions B is called an orthonormal basisin L2[0, 1] if: (1) the L2-norm in the interval [0, 1] of any function in this

set is equal to one, that is,∫ 10 g(x) dx = 1 for any g ∈ B, and (2) the dot

product of any two functions in B is zero, that is,∫ 10 g1(x)g2(x) dx = 0 for

any g1, g2 ∈ B.Consider the following set of functions defined for all x in [0, 1]:

(10.23){1,

√2 sin(2πx),

√2 cos(2πx), . . . ,

√2 sin(2πkx),

√2 cos(2πkx), . . .

}.

Page 153: Mathematical Statistics - 213.230.96.51:8090

142 10. Estimation of Regression in Global Norms

This set is referred to as a trigonometric basis. The next proposition is astandard result from analysis. We omit its proof.

Proposition 10.12. The trigonometric basis in (10.23) is an orthonomalbasis.

Choose the trigonometric basis as a working basis in L2[0, 1] space. Forany function f(x), 0 ≤ x ≤ 1, introduce its Fourier coefficients by

a0 =

∫ 1

0f(x) dx, ak =

∫ 1

0f(x)

√2 cos(2πkx) dx

and

bk =

∫ 1

0f(x)

√2 cos(2πkx) dx, k = 1, 2, . . . .

The trigonometric basis is complete in the sense that if ‖f ||2 < ∞, and

fm(x) = a0 +m∑

k=1

ak√2 cos(2πkx) +

m∑

k=1

bk√2 sin(2πkx), 0 ≤ x ≤ 1,

then

limm→∞

‖ fm(·) − f(·) ‖2 = 0.

Thus, a function f with a finite L2-norm is equivalent to its Fourier series

f(x) = a0 +∞∑

k=1

ak√2 cos(2πkx) +

∞∑

k=1

bk√2 sin(2πkx),

though they may differ at the points of discontinuity.

The next lemma links the decrease rate of the Fourier coefficients withβ, β ≥ 1, the smoothness parameter of f .

Lemma 10.13. If∑∞

k=1 (a2k + b2k) k

2β ≤ L for some constant L, then

‖ f (β) ‖22 ≤ (2π)2βL.

Proof. We restrict the calculations to the case β = 1. See Exercise 10.73for the proof of the general case. For β = 1, we have

f ′(x) =∞∑

k=1

[ak(−2πk)

√2 sin(2πkx) + bk(2πk)

√2 cos(2πkx)

].

Thus,

‖ f ′ ‖22 = (2π)2∞∑

k=1

k2(a2k + b2k) ≤ (2π)2L. �

Page 154: Mathematical Statistics - 213.230.96.51:8090

10.5. Orthogonal Series Regression Estimator 143

10.5.2. Discrete Fourier Series and Regression. Consider the obser-vations

(10.24) yi = f(i/n)+ εi , εi ∼ N (0, σ2), i = 1, . . . , n.

To ease presentation, it is convenient to assume that n = 2n0 + 1 is an oddnumber, n0 ≥ 1.

Our goal is to estimate the regression function f in the integral sensein L2[0, 1] . Since our observations are discrete and available only at theequidistant design points xi = i/n, we want to restore the regression functionexclusively at these points.

Consider the values of the trigonometric basis functions in (10.23) at thedesign points xi = i/n, i = 1, . . . , n,(10.25){

1,√2 sin

(2πin

),√2 cos

(2πin

), . . . ,

√2 sin

(2πkin

),√2 cos

(2πkin

), . . .

}.

For any functions g, g1 and g2 ∈ L2[0, 1], define the discrete dot productand the respective squared L2-norm by the Riemann sums

(g1(·), g2(·)

)2, n

=1

n

n∑

i=1

g1(i/n)g2(i/n)

and ‖ g(·) ‖22, n =1

n

n∑

i=1

g2(i/n).

Clearly, the values at the design points for each function in (10.25) rep-resent a vector in R

n. Therefore, there cannot be more than n orthonormalfunctions with respect to the discrete dot product. As shown in the lemmabelow, the functions in (10.25) corresponding to k = 1, . . . , n0, form anorthonormal basis with respect to this dot product.

Lemma 10.14. Fix n = 2n0 + 1 for some n0 ≥ 1. For i = 1, . . . , n, thesystem of functions{1,√2 sin

(2πin

),√2 cos

(2πin

), . . . ,

√2 sin

(2πn0i

n

),√2 cos

(2πn0i

n

)}

is orthonormal with respect to the discrete dot product.

Proof. For any k and l, the elementary trigonometric identities hold:

(10.26) sin(2πki

n

)sin(2πli

n

)=

1

2

[cos(2π(k − l)i

n

)− cos

(2π(k + l)i

n

) ]

and

(10.27) cos(2πki

n

)cos(2πli

n

)=

1

2

[cos(2π(k − l)i

n

)+ cos

(2π(k + l)i

n

) ].

Also, as shown in Exercise 10.74, for any integer m = 0(modn),

(10.28)n∑

i=1

cos(2πmi

n

)=

n∑

i=1

sin(2πmi

n

)= 0.

Page 155: Mathematical Statistics - 213.230.96.51:8090

144 10. Estimation of Regression in Global Norms

Now fix k = l such that k, l ≤ n0. Note that then k±l = 0(modn). Lettingm = k ± l in (10.28), and applying (10.26)-(10.28), we obtain that

n∑

i=1

sin(2πki

n

)sin(2πli

n

)=

1

2

n∑

i=1

cos(2π(k − l)i

n

)

− 1

2

n∑

i=1

cos(2π(k + l)i

n

)= 0

andn∑

i=1

cos(2πki

n

)cos(2πli

n

)=

1

2

n∑

i=1

cos(2π(k − l)i

n

)

+1

2

n∑

i=1

cos(2π(k + l)i

n

)= 0,

which yields that the respective dot products are zeros.

If k = l ≤ n0, then from (10.26)-(10.28), we have

n∑

i=1

sin2(2πki

n

)=

1

2

n∑

i=1

cos(0) = n/2

andn∑

i=1

cos2(2πki

n

)=

1

2

n∑

i=1

cos(0) = n/2.

These imply the normalization condition,

∥∥√2 sin

(2πkin

) ∥∥22, n

=∥∥√2 cos

(2πkin

) ∥∥22, n

= 1.

Finally, for any k, l ≤ n0, from the identity

sin(2πki

n

)cos(2πli

n

)=

1

2

[sin(2π(k + l)i

n

)+ sin

(2π(k − l)i

n

) ]

and (10.28), we have

n∑

i=1

sin(2πki

n

)cos(2πli

n

)=

1

2

n∑

i=1

sin(2π(k + l)i

n

)

+1

2

n∑

i=1

sin(2π(k − l)i

n

)= 0. �

Page 156: Mathematical Statistics - 213.230.96.51:8090

10.5. Orthogonal Series Regression Estimator 145

For a given integer β and a positive constant L, introduce a class offunctions Θ2, n = Θ2,n(β, L) defined at the design points xi = i/n, i =1, . . . , n. We say that f ∈ Θ2,n(β, L) if

f(i/n)= a0 +

n0∑

k=1

[ak√2 cos

(2πkin

)+ bk

√2 sin

(2πkin

) ], n = 2n0 + 1,

where the Fourier coefficients ak and bk satisfy the conditionn0∑

k=1

[a2k + b2k

]k2β ≤ L.

Note that there are a total of n Fourier coefficients, a0, a1, . . . , an0 ,b1, . . . , bn0 , that define any function f in the class Θ2,n(β, L). It should beexpected, because any such function is equivalent to a vector in R

n.

The class Θ2,n(β, L) replaces the Holder class Θ(β, L, L1) in our earlierstudies. However, in view of Lemma 10.13, the parameter β still representsthe smoothness of regression functions.

We want to estimate the regression function f in the discrete MISE. Thequadratic risk, for which we preserve the notation Rn(fn, f), has the form

Rn(fn, f) = Ef

[ ∥∥ fn(·)− f(·)∥∥22, n

]= Ef

[ 1n

n∑

i=1

(fn(i/n)− f(i/n) )2 ]

.

Thus far, we have worked with the functions of sine and cosine separately.It is convenient to combine them in a single notation. Put ϕ0

(i/n)

= 1,and c0 = a0. For m = 1, . . . , n0, take

ϕ2m

(i/n)=

√2 cos

(2πmi

n

), c2m = am,

and

ϕ2m−1

(i/n)=

√2 sin

(2πmi

n

), c2m−1 = bm.

Note that altogether we have n basis functions ϕk, k = 0, . . . , n − 1. Theysatisfy the orthonormality conditions

(ϕk(·), ϕl(·)

)2, n

=1

n

n∑

i=1

ϕk

(i/n)ϕl

(i/n)= 0, for k = l,

(10.29) and∥∥ϕk(·)

∥∥22, n

=1

n

n∑

i=1

ϕ2k

(i/n)= 1.

The regression function f at the design points xi = i/n can be written as

f(i/n)=

n−1∑

k=0

ck ϕk

(i/n).

Page 157: Mathematical Statistics - 213.230.96.51:8090

146 10. Estimation of Regression in Global Norms

It is easier to study the estimation problem with respect to the discreteL2-norm in the space of the Fourier coefficients ck, called a sequence space.Assume that these Fourier coefficients are estimated by ck, k = 0, . . . , n− 1.Then the estimator of the regression function in the original space can beexpressed by the sum

(10.30) fn(i/n)=

n−1∑

k=0

ck ϕk

(i/n).

Lemma 10.15. The discrete MISE of the estimator fn in (10.30) can bepresented as

Rn(fn, f) = Ef

[ ∥∥ fn(·) − f(·)∥∥22, n

]= Ef

[ n−1∑

k=0

(ck − ck

)2 ].

Proof. By the definition of the risk function Rn(fn, f), we have

Rn(fn, f) = Ef

[ ∥∥∥n−1∑

k=0

(ck − ck

)ϕk(·)

∥∥∥2

2, n

]

= Ef

[ n−1∑

k, l=0

(ck − ck

)(cl − cl

)(ϕk(·), ϕl(·)

)2,n

]= Ef

[ n−1∑

k=0

(ck − ck

)2 ]

where we used the fact that the basis functions are orthonormal. �

To switch from the original observation yi to the corresponding obser-vation in the sequence space, consider the following transformation:

(10.31) zk =(yi, ϕk(·)

)2, n

=1

n

n∑

i=1

yi ϕk

(i/n), k = 0, . . . , n− 1.

Lemma 10.16. The random variables zk, k = 0, . . . , n−1 defined by (10.31)satisfy the equations

zk = ck + σ ξk/√n, k = 0, . . . , n− 1,

for some independent standard normal random variables ξk.

Proof. First, observe that

yi = c0 ϕ0

(i/n)+ . . . + cn−1 ϕn−1

(i/n)+ εi

where the error terms εi’s are independent N (0, σ2)-random variables, i =1, . . . , n. Thus, for any k = 0, . . . , n− 1, we can write

zk =1

n

n∑

i=1

[c0 ϕ0

(i/n)+ · · ·+ cn−1 ϕn−1

(i/n) ]

ϕk

(i/n)+

1

n

n∑

i=1

εi ϕk

(i/n).

Page 158: Mathematical Statistics - 213.230.96.51:8090

10.5. Orthogonal Series Regression Estimator 147

By the orthonormality conditions (10.29), the first sum is equal to ck, andthe second one can be written as σξk/

√n where

ξk =

√n

σ

( 1

n

n∑

i=1

εi ϕk

(i/n) )

=( σ2

n

n∑

i=1

ϕ2k

(i/n) )−1/2 1

n

n∑

i=1

εi ϕk

(i/n)∼ N (0, 1).

As a result,

zk = ck + σξk/√n.

It remains to show that the ξ’s are independent. Since they are normallydistributed, it suffices to show that they are uncorrelated. This in turnfollows from independence of the ε’s and orthogonality of the ϕ’s. Indeed,we have that for any k = l such that k, l = 0, . . . , n− 1,

Cov(ξk, ξl) =1

σ2 nE

[ ( n∑

i=1

εi ϕk

(i/n) )( n∑

i=1

εi ϕl

(i/n) ) ]

=1

σ2 nE

[ n∑

i=1

ε2i ϕk

(i/n)ϕl

(i/n) ]

=1

n

n∑

i=1

ϕk

(i/n)ϕl

(i/n)= 0. �

The orthogonal series (or projection) estimator of regression function f isdefined by

(10.32) fn(i/n)=

M∑

k=0

zk ϕk

(i/n), i = 1, . . . , n,

where M = M(n) is an integer parameter of the estimation procedure.

Note that fn(i/n)is indeed an estimator, because it is computable from the

original observed responses y1, . . . , yn . The parameter M serves to balancethe bias and variance errors of the estimation. The choice

M = M(n) = �(h∗n)−1� = �n1/(2β+1)�,

where �·� denotes an integer part of a number, turns out to be optimal inthe minimax sense. We will prove only the upper bound.

Theorem 10.17. Assume that the regression function f belongs to the classΘ2,n(β, L). Then, uniformly over this class, the quadratic risk in the discrete

MISE of the orthogonal series estimator fn given by (10.32) with M =�n1/(2β+1)� is bounded from above,

Rn(fn, f) = Ef

[ ∥∥ fn − f

∥∥22, n

]≤[σ2 + 4βL

]n−2β/(2β+1).

Page 159: Mathematical Statistics - 213.230.96.51:8090

148 10. Estimation of Regression in Global Norms

Proof. Consider the orthogonal series estimator fn(i/n) specified by (10.32)

with M = �n1/(2β+1)�. Comparing this definition to a general form (10.30)of an estimator given by a Fourier series, we see that in this instance, theestimators of the Fourier coefficients ck, k = 0, . . . , n− 1, have the form

ck =

{zk, if k = 0, . . . ,M,

0, if k = M + 1, . . . , n− 1.

Now applying Lemmas 10.15 and 10.16, we get

Ef

[ ∥∥ fn − f∥∥22, n

]= Ef

[ M∑

k=0

(zk − ck)2]+

n−1∑

k=M+1

c2k

(10.33) =σ2

nEf

[ M∑

k=0

ξ2k

]+

n−1∑

k=M+1

c2k =σ2M

n+

n−1∑

k=M+1

c2k.

Next, let M0 = �(M + 1)/2�. By the definitions of the functional spaceΘ2,n and the basis function ϕk , the following inequalities hold:

n−1∑

k=M+1

c2k ≤n0∑

k=M0

[a2k + b2k

]≤ M−2β

0

n0∑

k=M0

[a2k + b2k

]k2β ≤ LM−2β

0 .

Substituting this estimate into (10.33), and noticing that M0 ≤ M/2, weobtain that

Rn(fn, f) ≤ σ2M

n+ LM−2β

0

≤ σ2M

n+ 22βLM−2β ≤

[σ2 + 4βL

]n−2β/(2β+1). �

Exercises

Exercise 10.69. Verify the rate of convergence in (10.4).

Exercise 10.70. Prove the inequality (10.11).

Exercise 10.71. Prove a more accurate bound in (10.11),

E[Z∗ ∣∣X

]≤ Cz

√lnQ where Q = 1/(2hn).

Also show that the respective balance equation for the sup-norm in Remark

10.7 is hβn =√(nhn)−1 lnh−1

n . Solve this equation assuming that n is large.

Page 160: Mathematical Statistics - 213.230.96.51:8090

Exercises 149

Exercise 10.72. Prove that for the regressogram under Assumption 10.1,dim(S) = K = β Q.

Exercise 10.73. Prove Lemma 10.13 for any integer β > 1.

Exercise 10.74. Prove formula (10.28).

Page 161: Mathematical Statistics - 213.230.96.51:8090
Page 162: Mathematical Statistics - 213.230.96.51:8090

Chapter 11

Estimation by Splines

11.1. In Search of Smooth Approximation

The regressogram approach to estimation of a regression function f hasseveral advantages. First of all, it is a piecewise polynomial approximationthat requires the computation and storage of only β Q coefficients. Second,the computational process for these coefficients is divided into the Q sub-problems, each of dimension β which does not increase with n. Third, alongwith the regression function f , its derivatives f (m) up to the order β −1 can be estimated (see Remark 10.5). The forth advantage is that theregressogram works in the whole interval [0, 1], and the endpoints do notneed special treatment.

A big disadvantage of the regressogram is that it suggests a discontinu-ous function as an approximation of a smooth regression function f(x). Animmediate idea is to smooth the regressogram, substituting it by a convolu-tion with some kernel K,

smoother of fn(x) =

∫ x+hn

x−hn

1

hnK( t − x

hn

)fn(t) dt.

Unfortunately, the convolution smoother has shortcomings as well. Theendpoints effect exists, hence at these points the estimator should be de-fined separately. Besides, unless the kernel itself is a piecewise polynomialfunction, the smoother is no longer piecewise polynomial with ensuing com-putational difficulties.

A natural question arises: Is it possible to find a piecewise polynomialestimator fn of the regression function f in the interval [0, 1] that would be

151

Page 163: Mathematical Statistics - 213.230.96.51:8090

152 11. Estimation by Splines

a smooth function up to a certain order? It turns out that the answer tothis question is positive.

Suppose that we still have Q bins, and the regression function f is ap-proximated in each bin by a polynomial. If the regression function belongsto the Holder class Θ(β, L, L1), then the justifiable order of each polynomialis β − 1. Indeed, beyond this order, we do not have any control over the de-terministic remainder term in Proposition 10.2. However, it is impossible tohave a smooth polynomial estimator fn with the continuous derivatives upto the order β − 1. It would impose β constraints at each knot (breakpoint)between the bins 2hnq, q = 1, . . . , Q − 1. Thus, the polynomial coefficientsin each next bin would be identical to those in the previous one. It makesfn(x) a single polynomial of order β − 1. Clearly, we cannot approximate aHolder regression function by a single polynomial in the whole interval [0, 1].

Is it possible to define a piecewise polynomial fn that has β−2 continuousderivatives in [0, 1]? This question makes sense if β ≥ 2 (for β = 2 itmeans that the function itself is continuous). The answer to this questionis affirmative. In Q bins we have βQ polynomial coefficients. At the Q− 1inner knots between the bins we impose (β− 1)(Q− 1) = βQ − (Q+β− 1)

gluing conditions to guarantee the continuity of fn with the derivatives upto the order β−2. Still Q+β−1 degrees of freedom are left, at least one perbin, that can be used to ensure some approximation quality of the estimator.

In the spirit of formula (10.14), we can try to define a smooth piecewisepolynomial approximation by

fn(x) = θ1 γ1(x) + · · · + θK γK(x), x ∈ [0, 1],

where γ1(x), . . . , γK(x) are piecewise polynomials. We require these func-tions to be linearly independent in order to form a basis in [0, 1]. We willshow that there exists theoretically and computationally convenient basisof the piecewise polynomial functions called B-splines. To introduce theB-splines, we need some auxiliary results presented in the next section.

11.2. Standard B-splines

Consider the linearly independent piecewise polynomial functions in Q bins,each of order β − 1 with β − 2 continuous derivatives. Let us address thequestion: What is the maximum number of such functions? As we argued inthe previous section, the answer is Q+β−1. We can rephrase our argumentin the following way. In the first bin we can have β linearly independentpolynomials (for example, all the monomials of order m = 0, . . . , β − 1).At each of the Q− 1 inner knots, the β − 1 constraints are imposed on thecontinuity of derivatives. This leaves just one degree of freedom for all the

Page 164: Mathematical Statistics - 213.230.96.51:8090

11.2. Standard B-splines 153

other Q − 1 bins. Thus, the number of piecewise polynomials in the basisequals β + (Q− 1).

First, we give the definition of a standard B-spline. Here “B” is shortfor “basis” spline. It is defined for infinitely many bins with unit length andinteger knots. A standard B-spline serves as a building block for a basis ofB-splines in the interval [0, 1].

A standard B-spline of order m, denoted by Sm(u), u ∈ R, is a functionsatisfying the recurrent convolution

(11.1) Sm(u) =

∫ ∞

−∞Sm−1(z) I[0, 1)(u − z) dz, m = 2, 3, . . . ,

with the initial standard spline S1(u) = I[0,1)(u). Note that, by the convo-lution formula, the function Sm(u) is the probability density function of asum of m independent random variables uniformly distributed on [0, 1].

Since the splines are piecewise continuous functions, their higher deriva-tives can have discontinuities at the knots. We make an agreement to definethe derivatives as right-continuous functions. This is the reason to use thesemi-open interval in (11.1).

It is far from being obvious that a standard B-spline meets all therequirements of a piecewise polynomial function of the certain degree ofsmoothness. Nevertheless, it turns out to be true. The lemmas below de-scribe some analytical properties of standard B-splines.

Lemma 11.1. (i) For any m ≥ 2 ,

(11.2) S′m(u) = Sm−1(u) − Sm−1(u− 1) , u ∈ R.

(ii) For any m ≥ 2, Sm(u) is strictly positive in (0,m) and is equal tozero outside of this interval.

(iii) Sm(u) is symmetric with respect to the endpoints of the interval(0,m), that is,

(11.3) Sm(u) = Sm(m− u), u ∈ R.

(iv) For any m ≥ 1 and for any u ∈ R, the equation (called the partitionof unity) holds:

(11.4)∞∑

j=−∞Sm(u− j) = 1.

Proof. (i) Differentiating formally (11.1) with respect to u, we obtain

S′m(u) =

∫ ∞

−∞Sm−1(z)

[δ{0}(u − z) − δ{1}(u − z)

]dz

Page 165: Mathematical Statistics - 213.230.96.51:8090

154 11. Estimation by Splines

= Sm−1(u) − Sm−1(u− 1)

where δ{a} is the Dirac delta-function concentrated at a .

(ii) This part follows immediately from the definition of the standardB-spline as a probability density.

(iii) If Uj is a uniformly distributed in [0, 1] random variable, then 1−Uj

has the same distribution. Hence the probability density of U1 + · · · + Um

is the same as that of m− (U1 + · · ·+ Um).

(iv) In view of part (ii), for a fixed u ∈ R, the sum∑∞

j=−∞ Sm(u− j)

has only a finite number of non-zero terms. Using this fact and (11.2), wehave[ ∞∑

j=−∞Sm(u−j)

]′=

∞∑

j=−∞S′m(u−j) =

∞∑

j=−∞

[Sm−1(u−j)−Sm−1(u−j−1)

]

=∞∑

j=−∞Sm−1(u− j)−

∞∑

j=−∞Sm−1(u− j − 1) = 0.

Indeed, the last two sums are both finite, hence they have the identicalvalues. Consequently, the sum

∑∞j=−∞ Sm(u − j) is a constant c, say. We

write

c =

∫ 1

0c du =

∫ 1

0

[ ∞∑

j=−∞Sm(u− j)

]du =

m−1∑

j=0

∫ 1

0Sm(u− j) du.

Here we used part (ii) once again, and the fact that the variable of integrationu belongs to the unit interval. Continuing, we obtain

c =m−1∑

j=0

∫ j+1

jSm(u) du =

∫ m

0Sm(u) du = 1,

for Sm(u) is the probability density for u in the interval [0,m]. �

Now we try to answer the question: How smooth is the standard B-splineSm(u)? The answer can be found in the following lemma.

Lemma 11.2. For any m ≥ 2, the standard B-spline Sm(u), u ∈ R, is apiecewise polynomial of order m− 1. It has continuous derivatives up to theorder m−2, and its derivative of order m−1 is a piecewise constant functiongiven by the sum

(11.5) S(m−1)m (u) =

m−1∑

j=0

(−1)j(m− 1

j

)I[j, j+1)(u).

Page 166: Mathematical Statistics - 213.230.96.51:8090

11.3. Shifted B-splines and Power Splines 155

Proof. We start by stating the following result. For any m ≥ 2, the k-thderivative of Sm(u) can be written in the form

(11.6) S(k)m (u) =

k∑

j=0

(−1)j(k

j

)Sm−k(u− j), k ≤ m− 1.

The shortest way to verify this identity is to use induction on k startingwith (11.2). We leave it as an exercise (see Exercise 11.76).

If k ≤ m − 2, then the function Sm−k(u − j) is continuous for any j.Indeed, all the functions Sm(u), m ≥ 2, are continuous as the convolutionsin (11.1). Thus, by (11.6), as a linear combination of continuous functions,

S(k)m (u), k ≤ m−2, is continuous in u ∈ R. Also, for k = m−1, the formula

(11.6) yields (11.5).

It remains to show that Sm(u), u ∈ R, is a piecewise polynomial of orderm− 1. From (11.2), we obtain

(11.7) Sm(u) =

∫ u

0

[Sm−1(z) − Sm−1(z − 1)

]dz.

Note that by definition, S1(u) = I[0, 1)(u) is a piecewise polynomial of orderzero. By induction, if Sm−1(u) is a piecewise polynomial of order at mostm − 2, then so is the integrand in the above formula. Therefore, Sm(u) isa piecewise polynomial of order not exceeding m− 1. However, from (11.5),the (m− 1)-st derivative of Sm(u) is non-zero, which proves that Sm(u) hasorder m− 1. �

Remark 11.3. To restore a standard B-spline Sm(u), it suffices to look at(11.5) as a differential equation

(11.8)dSm(u)

d um=

m−1∑

j=0

λj I[j,j+1)(u)

with the constants λj defined by the right-hand side of (11.5), and to solveit with the zero initial conditions,

Sm(0) = S′m(0) = · · · = S(m−2)

m (0) = 0. �

11.3. Shifted B-splines and Power Splines

The results of this section play a central role in our approach to splineapproximation. They may look somewhat technical. From now on, weassume that the order m of all the splines under consideration is given andfixed, m ≥ 2. We start with the definition of the shifted B-splines. Considerthe shifted B-splines in the interval [0,m− 1),

Sm(u), Sm(u− 1), . . . , Sm(u− (m− 2)), 0 ≤ u < m− 1.

Page 167: Mathematical Statistics - 213.230.96.51:8090

156 11. Estimation by Splines

Let Ls be the linear space generated by the shifted B-splines,

Ls ={LS(u) : LS(u) = a0Sm(u)+a1Sm(u−1)+· · ·+am−2Sm(u−(m−2))

}

where a0, . . . , am−2 are real coefficients. Put a = (a0, . . . , am−2)′ for the

vector of these coefficients.

We need more definitions. Consider the piecewise polynomial functions,called the power splines,

(11.9) Pk(u) =1

(m− 1)!(u− k)m−1

I(u ≥ k), k = 0, . . . ,m− 2.

Note that we define the power splines on the whole real axis. In what fol-lows, however, we restrict our attention to the interval [0,m− 1).

Similar to Ls, introduce the linear space Lp of functions generated bythe power splines,

Lp ={LP (u) : LP (u) = b0P0(u) + b1P1(u) + · · · + bm−2Pm−2(u)

}

with the vector of coefficients b = (b0, . . . , bm−2)′.

Lemma 11.4. In the interval [0,m − 1), the linear spaces Ls and Lp areidentical. Moreover, there exists a linear one-to-one correspondence betweenthe coefficients a and b in the linear combinations of shifted B-splines andpower splines.

Proof. The proof is postponed until Section 11.5. �

For any particular linear combination LS ∈ Ls, consider its derivativesat the right-most point u = m− 1,

ν0 = LS(0)(m− 1) , ν1 = LS(1)(m− 1), . . . , νm−2 = LS(m−2)(m− 1),

and put ν = ( ν0, . . . , νm−2 )′. Is it possible to restore the function LS(u) =

a0Sm(u) + a1Sm(u − 1) + · · · + am−2Sm(u − (m − 2)) from these deriva-tives? In other words, is it possible to restore in a unique way the vector ofcoefficients a from ν? As the following lemma shows, the answer is assertive.

Lemma 11.5. There exists a linear one-to-one correspondence between aand ν.

Proof. Can be found in the last section of the present chapter. �

Remark 11.6. Though our principal interest lies in the shifted B-splines,we had to involve the power splines for the following reason. The derivativesof the power splines at the right-most point provide the explicit formula(11.19), while for the shifted B-splines this relation is more complex. So,the power splines are just a technical tool for our consideration. �

Page 168: Mathematical Statistics - 213.230.96.51:8090

11.3. Shifted B-splines and Power Splines 157

Next, let us discuss the following problem. Consider the shifted B-spline from Ls added by another one, Sm(u − (m − 1)). All these shiftedB-splines have a non-trivial effect in the interval [m− 1, m) . Assume thata polynomial g(u) of degree m− 1 or less is given in the interval [m− 1,m).Can we guarantee a representation of this polynomial in [m−1,m) as a linearcombination of the shifted standard B-splines? The answer to this questionis revealed in the next two lemmas. The first one explains that all thesesplines, except the last one, can ensure the approximation of the derivativesof g(u) at u = m− 1. At this important step, we rely on Lemma 11.5. Thelast spline Sm(u− (m− 1)) is used to fit the leading coefficient of g(u). Thisis done in the second lemma. It is remarkable that the polynomial g(u) notonly coincides with the linear combination of B-splines in [m− 1,m), but italso controls the maximum of this linear combination for u ∈ [0,m− 1).

Lemma 11.7. Denote by ν0 = g(0)(m−1) , . . . , νm−2 = g(m−2)(m−1) thederivatives of the polynomial g(u) at u = m−1. There exists a unique linearcombination LS(u) = a0Sm(u) + a1Sm(u−1) + · · · + am−2Sm(u−(m−2))that solves the boundary value problem

LS(0)(m− 1) = ν0 , . . . , LS(m−2)(m− 1) = νm−2.

Moreover, there exists a constant C(m) such that

max0≤u≤m−1

|LS(u) | ≤ C(m) max[| ν0 | , . . . , | νm−2 |

].

Proof. In accordance with Lemma 11.5, there exists a one-to-one linearcorrespondence between a and ν, which implies the inequality

max[| a0 | , . . . , | am−2 |

]≤ C(m) max

[| ν0 | , . . . , | νm−2 |

]

with a positive constant C(m). Also, we can write∣∣LS(u)

∣∣ =∣∣ a0Sm(u) + · · · + am−2Sm(u− (m− 2))

∣∣

≤ max[| a0 | , . . . , | am−2 |

]m−2∑

j=0

Sm(u− j)

≤ max[| a0 | , . . . , | am−2 |

] ∞∑

j=−∞Sm(u− j)

≤ max[| a0 | , . . . , | am−2 |

]≤ C(m) max

[| ν0 | , . . . , | νm−2 |

]

where we applied the partition of unity (11.4). �

Lemma 11.8. For any polynomial g(u) of order m− 1, m− 1 ≤ u < m,there exists a unique linear combination of the shifted standard B-splines

LS∗(u) = a0Sm(u) + · · · + am−2Sm(u− (m− 2)) + am−1Sm(u− (m− 1))

such that LS∗(u) = g(u), if m− 1 ≤ u < m.

Page 169: Mathematical Statistics - 213.230.96.51:8090

158 11. Estimation by Splines

Proof. Find LS(u) = a0Sm(u) + · · · + am−2Sm(u − (m− 2)) such that allthe derivatives up to the order m − 2 of g(u) and LS(u) are identical atu = m− 1. By Lemma 11.7 such a linear combination exists and is unique.Note that (m−1)-st derivatives of LS(u) and g(u) are constants in [m−1,m).If these constants are different, we can add another B-spline, LS∗(u) =LS(u)+am−1Sm(u−(m−1)). The newly added spline Sm(u−(m−1)) doesnot change LS(u) in [0,m− 1). By choosing the coefficient am−1 properly,we can make the (m−1)-st derivatives of LS∗(u) and g(u) identical while allthe derivatives of LS(u) of the smaller orders stay unchanged at u = m−1,because Sm(u− (m− 1)) has all derivatives up to the order m− 2 equal tozero at u = m− 1. Figure 8 illustrates the statement of this lemma. �

�0 m− 1 m u

g(u)

LS∗(u)

Figure 8. The linear combination LS∗(u) coincides with the polynomialg(u) in [m− 1, m).

11.4. Estimation of Regression by Splines

For a chosen bandwidth hn, consider an integer number Q = 1/(2hn) of thebins

Bq =[2(q − 1)hn, 2qhn

), q = 1, . . . , Q.

We are supposed to work with a regression function f(x) that belongs toa fixed Holder class of functions Θ(β, L, L1). Let Sβ(u) be the standard B-spline of order m = β. This parameter β will determine the order of all thesplines that follow. Sometimes, it will be suppressed in the notation.

The number of bins Q increases as the bandwidth hn → 0, while theorder β stays a constant as n → ∞. That is why it is not restrictive toassume that Q exceeds β. We make this assumption in this section. In theinterval [0, 1], we define a set of functions

(11.10) γk(x) = hβnSβ

( x− 2hnk

2hn

)I[0, 1](x) , k = −β + 1, . . . , Q− 1.

We call these functions scaled splines or, simply, splines of order β . Tovisualize the behavior of the splines, notice that, if we disregard the indi-cator function I[0,1](x), then the identity γk(x) = γk+1(x + 2hn) holds. It

Page 170: Mathematical Statistics - 213.230.96.51:8090

11.4. Estimation of Regression by Splines 159

means that as k ranges from −β + 1 to Q − 1, the functions γk(x) movefrom left to right, every time shifting by 2hn, the size of one bin. Now wehave to restrict the picture to the unit interval, truncating the functionsγ−β+1, . . . , γ−1 below zero and γQ−β+1, . . . , γQ−1 above one. An analogycan be drawn between the performance of γk(x) as k increases and takingperiodic snapshots of a hunchback beast who gradually crawls into the pic-ture from the left, crosses the space, and slowly disappears on the right stilldragging its tail away. Figure 9 contains an illustration of the case β = 3.

34h3n

12h3n

x

γk(x)

0 2hn 4hn 2hn(Q−1) 1B1 B2 B3 BQ− 2 BQ− 1 BQ

γ−2

γ−1 γ0 γ1 γQ− 3 γQ− 2

γQ− 1

Figure 9. Graphs of functions γk(x), k = −2, . . . , Q− 1, when β = 3.

From properties of the standard B-splines, it immediately follows thatwithin each bin Bq, the function γk(x) is a polynomial of order β − 1 withcontinuous derivatives up to order β−2. The knots are the endpoints of thebins. Under the assumption that Q is greater than or equal to β, there is atleast one full-sized spline γk the support of which contains all β bins. Forinstance, γ0 is such a spline.

The proof of the next lemma is postponed to the end of this chapter.

Lemma 11.9. The set of functions { γk(x), k = −β +1, . . . , Q− 1 } formsa basis in the linear sub-space of the smooth piecewise polynomials of orderβ − 1 that are defined in bins Bq, and have continuous derivatives up toorder β − 2. That is, any γ(x) in this space admits a unique representation

(11.11) γ(x) =

Q−1∑

k=−β+1

θk γk(x) , x ∈ [0, 1],

with some real coefficients θ−β+1, . . . , θQ−1.

Now we return to the regression observations yi = f(xi) + εi, i =1, . . . , n, where f(x) is a Holder function in Θ(β, L, L1). In this section, wepursue the modest goal of the asymptotic analysis of the discrete MISE

Page 171: Mathematical Statistics - 213.230.96.51:8090

160 11. Estimation by Splines

for the spline approximation of the regression function. We want to provethe result similar to Theorem 10.11, and in particular, the analogue of in-equality (10.22). Note that the proof of Theorem 10.11 is heavily based onthe relation between the approximation error δn (bias of the regressogram)in Proposition 10.10 and the bandwidth hn. We need a similar result forapproximation by splines.

In the spirit of the general approximation by functions γk(x) in the spaceof observations R

n, we introduce the span-space S as a linear sub-spacegenerated by the vectors

γk =(γk(x1), . . . , γk(xn)

)′, k = −β + 1 , . . . , Q− 1.

Following the agreement of Section 10.4, by S we also denote the operatorof the orthogonal projection on this span-space.

Remark 11.10. Note that S is a linear sub-space in Rn of the dimension

which does not exceedK = Q+β−1. For the regular designs and sufficientlylarge n, this dimension should be K. But for a particular design, generallyspeaking, this dimension can be strictly less than K. �

The following lemma is a version of Proposition 10.10 with δn = O(hβn).Its proof can be found at the end of the present chapter.

Lemma 11.11. There exists a constant C0 independent of n such that forany regression function f ∈ Θ(β, L, L1) and for any design X = {x1, . . . , xn },we can find f∗n =

(f∗n(x1), . . . , f

∗n(xn)

)′that belongs to S, for which at any

design point xi the following inequality holds:

(11.12) | f∗n(xi) − f(xi) | ≤ C0 h

βn, i = 1, . . . , n.

Remark 11.12. The vector f∗n, as any vector in S, admits the representation

f∗n =

Q−1∑

k=−β+1

θkγk

with some real coefficients θk . Hence this vector can also be associated withthe function defined by (11.11),

f∗n(x) =

Q−1∑

k=−β+1

θk γk(x), x ∈ [0, 1].

This representation, which is not necessarily unique, defines a spline approx-imation of the function f(x). �

We are ready to extend the result stated in Theorem 10.11 for the re-gressogram to the approximation by splines.

Page 172: Mathematical Statistics - 213.230.96.51:8090

11.5. Proofs of Technical Lemmas 161

Theorem 11.13. For any design X , the projection

fn = S y =(fn(x1), . . . , fn(xn)

)′

of the regression observations y = ( y1, . . . , yn )′ on span-space S generated

by the splines of order β, admits the upper bound of the discrete L2-normrisk

(11.13) Ef

[ 1n

n∑

i=1

(fn(xi) − f(xi)

)2 | X]≤ C1 h

2βn +

σ2(Q+ β − 1)

n.

Moreover, under the optimal choice of the bandwidth hn = h∗n = n−1/(2β+1),the following upper bound holds:

(11.14) Ef

[ 1n

n∑

i=1

(fn(xi) − f(xi)

)2 | X]≤ r∗ n−β/(2β+2).

In the above, the constants C1 and r∗ are positive and independent of n andf ∈ Θ(β, L, L1).

Proof. The result follows immediately from the bound (11.12) on the ap-proximation error by splines (cf. the proof of Theorem 10.11.) �

Remark 11.14. With the splines γk(x) of this section, we could introducethe design matrix with column vectors (10.16) as well as the system ofnormal equations (10.17). In the case of B-splines, however, the systemof normal equations does not partition into sub-systems as was the case ofthe regressogram. It makes the asymptotic analysis of spline approximationtechnically more challenging as compared to the one of the regressogram.In particular, an analogue of Proposition 10.2 with explicit control over thebias and the stochastic terms goes beyond the scope of this book. �

11.5. Proofs of Technical Lemmas

Proof of Lemma 11.4. It is easy to show (see Exercise 11.77) that, accord-ing to (11.5), the (m − 1)-st derivative of LS ∈ Ls is a piecewise constantfunction

(11.15) LS(m−1)(u) = λj , if j ≤ u < j + 1,

where for j = 0, . . . , m− 2,

λj =

j∑

i=0

ai(−1)j−i

(m− 1

j − i

)= a0(−1)j

(m− 1

j

)+ a1(−1)j−1

(m− 1

j − 1

)

(11.16) + · · ·+ aj−1(−1)

(m− 1

1

)+ aj

(m− 1

0

).

Page 173: Mathematical Statistics - 213.230.96.51:8090

162 11. Estimation by Splines

On the other hand, any power spline LP ∈ Lp also has the piecewise constant(m− 1)-st derivative

(11.17) LP (m−1)(u) = λj , if j ≤ u < j + 1,

with

(11.18) λj = b0 + · · · + bj , j = 0, . . . , m− 2.

In (11.15) and (11.17), we have deliberately denoted the (m−1)-st derivativeby the same λj ’s because we mean them to be identical. Introduce a vector

λ =(λ0, . . . , λm−2

)′. If we look at (11.16) and (11.18) as the systems of

linear equations for a and b , respectively, we find that the matrices of thesesystems are lower triangular with non-zero diagonal elements. Hence, thesesystems establish the linear one-to-one correspondence between a and λ, onthe one hand, and between λ and b, on the other hand. Thus, there existsa linear one-to-one correspondence between a and b. �Proof of Lemma 11.5. Applying Lemma 11.4, we can find a linear com-bination of the power splines such that

LS(u) = LP (u) = b0P0(u) + b1P1(u) + · · · + bm−2Pm−2(u)

= b01

(m− 1)!um−1

I[0,m−1)(u) + b11

(m− 1)!(u− 1)m−1

I[1,m−1)(u)

+ · · · + bm−21

(m− 1)!(u− (m− 2))m−1

I[m−2,m−1)(u).

The derivatives of the latter combination, νj = LP (j)(u), at the right-mostpoint u = m− 1 are computable explicitly (see Exercise 11.78),(11.19)

νj = b0(m− 1)m−j−1

(m− j − 1)!+ b1

(m− 2)m−j−1

(m− j − 1)!+ · · · + bm−2

(1)m−j−1

(m− j − 1)!.

If we manage to restore the coefficients b from the derivatives ν , then byLemma 11.4, we would prove the claim. Consider (11.19) as the system oflinear equations. Then the matrix M of this system is an (m− 1)× (m− 1)matrix with the elements

(11.20) Mj, k =(m− k − 1)m−j−1

(m− j − 1)!, j, k = 0, . . . , m− 2.

The matrix M is invertible because its determinant is non-zero (see Exercise11.79). Thus, the lemma follows. �

Proof of Lemma 11.9. As shown above, the dimension of the space ofsmooth piecewise polynomials of order no greater than β − 1 equals Q +β−1 which matches the number of functions γk(x). Thus, the only questionis about linear independence of functions γk(x), x ∈ [0, 1]. Consider the

Page 174: Mathematical Statistics - 213.230.96.51:8090

11.5. Proofs of Technical Lemmas 163

functions γk(x) for k = −β + 1, . . . , Q − β. In this set, each consecutivefunction has a support that contains a new bin not included in the union ofall the previous supports. That is why the γk(x)’s are linearly independentfor k = −β + 1, . . . , Q− β. Hence, the linear combinations

L1 ={a−β+1 γ−β+1(x) + · · · + aQ−β γQ−β(x)

∣∣ a−β+1, . . . , aQ−β ∈ R}

form a linear space of functions of dimension Q. A similar argument showsthat the linear combinations of the remaining splines

L2 ={aQ−β+1 γQ−β+1(x) + · · · + aQ−1 γQ−1(x)

∣∣ aQ−β+1, . . . , aQ−1 ∈ R}

form a linear space of dimension β − 1.

Since the supports of the functions from L1 cover the whole semi-openinterval [0, 1), the “support” argument does not prove independence of L1

and L2. We have to show that these spaces intersect only at the origin. In-deed, by the definition of the standard B-splines, the first (β−2) derivativesof any function from L1 at x = 1 are zeros. On the other hand, by Lemma11.7, a function from L2 has all its first β − 2 derivatives equal to zero, ifand only if all its coefficients are zeros, aQ−β+1 = · · · = aQ−1 = 0. Thus,a zero function is the only one that simultaneously belongs to L1 and L2. �

Proof of Lemma 11.11. Put q(l) = 1+βl. Consider all the bins Bq(l) withl = 0, . . . , (Q− 1)/β. Without loss of generality, we assume that (Q− 1)/βis an integer, so that the last bin BQ belongs to this subsequence. Note thatthe indices of the bins in the subsequence Bq(l) are equal to 1 modulo β,and that any two consecutive bins Bq(l) and Bq(l+1) in this subsequence areseparated by (β − 1) original bins.

Let xl = 2(q(l) − 1)hn denote the left endpoint of the bin Bq(l), l =0, 1, . . . . For any regression function f ∈ Θ(β, L, L1), introduce the Taylorexpansion of f(x) around x = xl,(11.21)

πl(x) = f(xl) +f (1)(xl)

1!(x−xl) + · · · + f (β−1)(xl)

(β − 1)!(x−xl)

β−1, x ∈ Bq(l).

In accordance with Lemma 11.8, for any l, there exists a linear combi-nation of the splines that coincides with πl(x) in Bq(l). It implies that

πl(x) = aq(l)−β γq(l)−β(x) + · · · + aq(l)−1 γq(l)−1(x) , x ∈ Bq(l),

with some uniquely defined real coefficients ak, k = q(l) − β, . . . , q(l) − 1.Note that as l runs from 1 to (Q − 1)/β, each of the splines γk(x), k =−β + 1, . . . , Q− 1, participates exactly once in these linear combinations.

Page 175: Mathematical Statistics - 213.230.96.51:8090

164 11. Estimation by Splines

Consider the sum

γ(x) =∑

1≤ l≤ (Q−1)/β

[aq(l)−β γq(l)−β(x) + · · · + aq(l)−1 γq(l)−1(x)

]

=

Q−1∑

k=−β+1

ak γk(x), 0 ≤ x ≤ 1.

This function γ(x) defines a piecewise polynomial of order at most β − 1that coincides with the Taylor polynomial (11.21) in all the bins Bq(l) (seeFigure 10). Hence, in the union of these bins

⋃l Bq(l), the function γ(x)

does not deviate away from f(x) by more than O(hβn), this magnitude beingpreserved uniformly over f ∈ Θ(β, L, L1).

Next, how close is γ(x) to f(x) in the rest of the unit interval? We wantto show that the same magnitude holds for all x ∈ [0, 1], that is,

(11.22) max0≤x≤ 1

∣∣ γ(x) − f(x)∣∣ ≤ C1 h

βn

with a constant C1 independent of f ∈ Θ(β, L, L1).

0

0

B1 B1+β BQ

B1 B1+β

x

x

1

f(x)

π1(x)

π2(x)

γ(x)

Δγ1(x)Δπ2(x)

Figure 10. Schematic graphs of the functions γ(x) and Δγ1(x) for x lyingin bins B1 through B1+β.

Clearly, it is sufficient to estimate the absolute value | γ(x) − f(x) | inthe gap between two consecutive bins Bq(l) and Bq(l+1). Consider the in-terval [xl, xl+1 + 2hn). It covers all the bins from Bq(l) to Bq(l+1), inclu-sively. The length of this interval is 2hn(β + 1). Hence, the regressionfunction f(x) does not deviate away from its Taylor approximation πl(x) in

this interval by more than O(hβn) uniformly over the Holder class. Thus,

Page 176: Mathematical Statistics - 213.230.96.51:8090

11.5. Proofs of Technical Lemmas 165

to verify (11.22), it is enough to check the magnitude of the differenceΔγl(x) = γ(x) − πl(x), x ∈ [xl, xl+1 + 2hn). Note that this difference isa piecewise polynomial of order at most β − 1 in the bins. In particular, itis a zero function for x ∈ Bq(l), and is equal to Δπl(x) = πl+1(x)− πl(x) forx ∈ Bq(l+1) (see Figure 10).

We want to rescale Δπl(x) to bring it to the scale of the integer bins ofunit length. Put

g(u) = h−βn Δπl(xl + 2hn(u+ 1)) with 0 ≤ u ≤ β − 1,

so that u = 0 corresponds to the left endpoint of Bq(l) and u = β − 1corresponds the left endpoint of Bq(l+1). Next, we compute the derivativesof g(u) at u = β − 1,

νi =dj

dujg(β − 1) = h−β

n (2hn)j dj

dxjΔπl(xl+1)

= 2jhj−βn

[f (j)(xl+1)−

(f (j)(xl) +

f (j+1)(xl)

1!(xl+1 − xl) + · · ·

· · ·+ f (β−1)(xl)

(β − 1− j)!(xl+1 − xl)

β−1−j)].

Note that the expression in the brackets on the right-hand side is theremainder term of the Taylor expansion of the j-th derivative f (j)(xl+1)

around xl. If f ∈ Θ(β, L, L1) , then f (j) belongs to the Holder class Θ(β −j, L, L2) with some positive constant L2 (see Exercise 11.81). Similar toLemma 10.2, this remainder term has the magnitude O( |xl+1 − xl |β−j ) =

O(hβ−jn ).

Thus, in the notation of Lemma 11.7, max[|ν0|, . . . , |νβ−1|

]≤ C1 where

the constant C1 does not depend on n nor l. From Lemma 11.7, the uniquespline of order β with zero derivatives at u = 0 and the given derivatives νjat u = β−1 is uniformly bounded for 0 ≤ u ≤ β−1. Since this is true for anyl, we can conclude that |g(u)| ≤ C2 = C(β)C1 at all u where this functionis defined. The constant C(β) is introduced in Lemma 11.7, m = β. So, we

proved that max0≤x≤1 | γ(x) − f(x) | = O(hβn) , which implies (11.12). �

Page 177: Mathematical Statistics - 213.230.96.51:8090

166 11. Estimation by Splines

Exercises

Exercise 11.75. Find explicitly the standard B-splines S2 and S3. Graphthese functions.

Exercise 11.76. Prove (11.6).

Exercise 11.77. Prove (11.16).

Exercise 11.78. Prove (11.19).

Exercise 11.79. Show that the determinant detM of the matrix M withthe elements defined by (11.20) is non-zero. Hint: Show that this deter-minant is proportional to the determinant of the generalized Vandermondematrix

Vm =

⎢⎢⎣

x1 x21 . . . xm1x2 x22 . . . xm2

. . .xm x2m . . . xmm

⎥⎥⎦

with distinct x1, . . . , xm. Look at detVm as a function of xm. If xm equalseither x1, or x2, . . . , xm−1, then the determinant is zero. Consequently,

detVm = v(x1, x2, . . . , xm) (xm − x1)(xm − x2) . . . (xm − xn−1)

for some function v. Now expand along the last row to see that the deter-minant is a polynomial in xm of order m with the highest coefficient equalto detVm−1. Thus, the recursive formula holds:

detVm = detVm−1 xm(xm − x1)(xm − x2) . . . (xm − xm−1), detV1 = 1.

From here deduce that detVm = x1 x2 . . . xm∏

i<j (xj − xi) = 0.

Exercise 11.80. Prove the statement similar to Lemma 11.8 for the powersplines. Show that for any polynomial g(u) of order m − 1 in the intervalm− 1 ≤ u < m, there exists a unique linear combination LP ∗(u) of powersplines

LP ∗(u) = b0P0(u) + b1P1(u) + · · · + bm−2Pm−2(u) + bm−1Pm−1(u)

such that LS∗(u) = g(u) if m− 1 ≤ u < m. Apply this result to representthe function g(u) = 2− u2 in the interval [2, 3) by the power splines.

Exercise 11.81. Show that if f ∈ Θ(β, L, L1), then f (j) belongs to theHolder class Θ(β − j, L, L2) with some positive constant L2.

Page 178: Mathematical Statistics - 213.230.96.51:8090

Chapter 12

Asymptotic Optimalityin Global Norms

In Chapter 10, we studied the regression estimation problem for the integralL2-norm and the sup-norm risks. The upper bounds in Theorems 10.3 and10.6 guarantee the rates of convergence in these norms that are n−β/(2β+1)

and(n/(lnn)

)−β/(2β+1), respectively. These rates hold for any function f

in the Holder class Θ(β, L, L1), and they are attained by the regressogramwith the properly chosen bandwidths.

The question that we address in this chapter is whether these rates canbe improved by any other estimators. The answer turns out to be negative.We will prove the lower bounds that show the minimax optimality of theregressogram.

12.1. Lower Bound in the Sup-Norm

In this section we prove that the rate

(12.1) ψn =( lnn

n

)β/(2β+1)

is the minimax rate of convergence in the Holder class of regression functionsΘ(β) = Θ(β, L, L1) if the losses are measured in the sup-norm. The theorembelow is another example of a “lower bound” (see Theorems 9.16 and 9.17).As explained in Section 9.4, the lower bound may not hold for any design.We use the definitions of regular deterministic and random designs as inSections 7.4 and 9.3.

167

Page 179: Mathematical Statistics - 213.230.96.51:8090

168 12. Asymptotic Optimality in Global Norms

Theorem 12.1. Let the deterministic design X be defined by (7.17) with acontinuous and strictly positive density p(x) on [0, 1]. Then for all large n,

and for any estimator fn of the regression function f , the following inequalityholds:

(12.2) supf ∈Θ(β)

Ef

[ψ−1n ‖ fn − f ‖∞

]≥ r∗

where ψn is defined in (12.1), and a positive constant r∗ is independent ofn.

Proof. As in the proof of Theorem 9.16, take any “bump” function ϕ(t), t ∈R, such that ϕ(0) > 0 , ϕ(t) = 0 if |t| > 1, and |ϕ(β)(t) | ≤ L. Clearly, thisfunction has a finite L2-norm, ‖ϕ‖2 < ∞. We want this norm to be small,therefore, below we make an appropriate choice of ϕ(t). Take a bandwidth

h∗n =((lnn)/n

)1/(2β+1), and consider the bins

Bq =[2h∗n(q − 1), 2h∗nq

), q = 1, . . . , Q,

where we assume without loss of generality that Q = 1/(2h∗n) is an integer.

Introduce the test functions f0(t) = 0 , and

(12.3) fq(t) = (h∗n)βϕ( t− cq

h∗n

), t ∈ [0, 1], q = 1, . . . , Q,

where cq is the center of the bin Bq. Note that each function fq(t) takesnon-zero values only within the respective bin Bq. For any small enoughh∗n, the function fq belongs to the Holder class Θ(β, L, L1). This fact wasexplained in the proof of Theorem 9.16.

Recall that under the hypothesis f = fq, the observations yi in thenonparametric regression model satisfy the equation

yi = fq(xi) + εi, i = 1, . . . , n,

where the xi’s are the design points, the εi’s are independent N (0, σ2)-random variables. Put

(12.4) d0 =1

2(h∗n)

β ‖ϕ ‖∞ > 0 .

Note that by definition,

(12.5) ‖ fl − fq ‖∞ = 2d0 , 1 ≤ l < q ≤ Q,

and

(12.6) ‖ fq ‖∞ = ‖ fq − f0 ‖∞ = 2d0, q = 1, . . . , Q.

Introduce the random events

Dq ={‖ fn − fq ‖∞ ≥ d0

}, q = 0, . . . , Q.

Page 180: Mathematical Statistics - 213.230.96.51:8090

12.1. Lower Bound in the Sup-Norm 169

Observe that for any q, 1 ≤ q ≤ Q, the inclusion takes place, D0 ⊆ Dq.

Indeed, by the triangle inequality, if fn is closer to f0 = 0 than d0, that is if‖ fn ‖∞ < d0, then it deviates away from any fq by no less than d0,

‖ fn − fq ‖∞ ≥ ‖ fq ‖∞ − ‖ fn ‖∞ = 2 d0 − ‖ fn ‖∞ ≥ d0, q = 1, . . . , Q.

Further, we will need the following lemma. We postpone its proof to theend of the section.

Lemma 12.2. Under the assumptions of Theorem 12.1, for any small δ > 0,there exists a constant c0 > 0 such that if ‖ϕ‖22 ≤ c0, then for all large n,

max0≤ q≤Q

Pfq

(Dq

)≥ 1

2(1− δ).

Now, we apply Lemma 12.2 to find that for all n large enough, thefollowing inequalities hold:

supf ∈Θ(β)

Ef

[‖ fn − f ‖∞

]≥ max

0≤ q≤QEfq

[‖ fn − fq ‖∞

]

≥ d0 max0≤ q≤Q

Pfq

(‖ fn − fq ‖∞ ≥ d0

)= d0 max

0≤ q≤QPfq

(Dq

)

≥ 1

2d0(1− δ) =

1

4(h∗n)

β ‖ϕ ‖∞ (1− δ),

and we can choose r∗ = (1/4)‖ϕ ‖∞ (1− δ). �

Remark 12.3. Contrast the proof of Theorem 12.1 with that of Theorem9.16. The proof of the latter theorem was based on two hypotheses, f = f0or f = f1, with the likelihood ratio that stayed finite as n → ∞. In the sup-norm, however, the proof of the rate of convergence is complicated by theextra log-factor, which prohibits using the same idea. The likelihood ratiosin the proof of Theorem 12.1 are vanishing as n → ∞. To counterweigh thatfact, a growing number of hypotheses is selected. Note that the number ofhypotheses Q + 1 ≤ n1/(2β+1) has the polynomial rate of growth as n goesto infinity. �

The next theorem handles the case of a random design. It shows that ifthe random design is regular, then the rate of convergence of the sup-normrisk is the same as that in the deterministic case. Since the random designcan be “very bad” with a positive probability, the conditional risk for givendesign points does not guarantee even the consistency of estimators. Thatis why we study the unconditional risks. The proof of the theorem below isleft as an exercise (see Exercise 12.83).

Theorem 12.4. Let X be a random design such that that design pointsxi are independent with a continuous and strictly positive density p(x) on

Page 181: Mathematical Statistics - 213.230.96.51:8090

170 12. Asymptotic Optimality in Global Norms

[0, 1]. Then for all sufficiently large n, and for any estimator fn(x) of theregression function f(x), the following inequality holds:

(12.7) supf ∈Θ(β)

Ef

[ψ−1n ‖ fn − f ‖∞

]≥ c0

with a positive constant c0 independent of n.

Proof of Lemma 12.2. From the inclusion D0 ⊆ Dq, which holds for anyq = 1, . . . , Q, we have that

max0≤ q≤Q

Pq

(Dq

)= max

(P0

(D0

), max1≤ q≤Q

Pq

(Dq

) )

≥ 1

2

(P0

(D0

)+ max

1≤ q≤QPq

(D0

) )

≥ 1

2

(P0

(D0

)+

1

Q

Q∑

q=1

E0

[I(D0

)exp

{Ln, q

} ] )

(12.8) =1

2

(P0

(D0

)+ E0

[I(D0

) 1

Q

Q∑

q=1

exp{Ln, q

} ] ).

In the above, by Ln, q we denoted the log-likelihood ratios

Ln, q = lndPq

dP0, q = 1, . . . , Q.

They admit the asymptotic representation

(12.9) Ln, q = σn, q Nn, q − 1

2σ2n, q

where for every q = 1, . . . , Q,

(12.10) σ2n, q = σ−2

n∑

i=1

f2q (xi) = n (h∗n)

2β+1 σ−2p(cq) ‖ϕ ‖22(1 + on(1)

)

with on(1) vanishing as n → ∞ uniformly in q. The random variables Nn, q

in (12.9) are standard normal and independent for different q.

Let p∗ denote the maximum of the density p(x), p∗ = max0≤x≤ 1 p(x).Recall that (h∗n)

2β+1 = (lnn)/n . Thus, if n is large enough, then

(12.11) σ2n, q ≤ 2σ−2 p∗ c0 lnn = c1 lnn

where the constant c1 = 2σ−2p∗c0 is small if c0 is small. Note that theconstant c1 is independent of q.

Put ξn = Q−1∑Q

q=1 exp{Ln, q

}. The first and the second moments of ξn

are easily computable. Indeed, since by definition, E0

[exp

{Ln, q

} ]= 1, we

Page 182: Mathematical Statistics - 213.230.96.51:8090

12.2. Bound in L2-Norm. Assouad’s Lemma 171

have that E0

[ξn]= 1. Applying the independence of the random variables

Nn,q for different q, we find that

E0

[ξ2n]= Q−2

Q∑

q=1

E0

[exp

{2Ln, q

} ]

= Q−2Q∑

q=1

E0

[exp

{2σn, q Nn, q − σn, q

} ]

= Q−2Q∑

q=1

exp{σ2n, q} ≤ Q−1ec1 lnn = Q−1 nc1 = 2h∗n n

c1 .

If we now take c0 so small that c1 = 2σ−2p∗c0 = 1/(2β+1)−ε for some small

ε > 0, then E0

[ξ2n]≤ 2h∗nn

c1 = 2(lnn)1/(2β+1)n−ε can be chosen arbitrarilysmall for sufficiently large n.

Next, by the Chebyshev inequality, we have

P0

(D0

)+ E0

[I(D0

)ξn]≥ (1− δ0)P0

(ξn ≥ 1− δ0

)

≥ (1− δ0)P0

(| ξn − 1 | ≤ δ0

)= (1− δ0)

(1 − P0

(| ξn − 1 | > δ0

) )

≥ (1− δ0)(1 − δ−2

0 E0

[ξ2n] )

→ 1− δ0 ≥ 1− δ if δ0 < δ.

Plugging this expression into (12.8), we obtain the result of the lemma. �

12.2. Bound in L2-Norm. Assouad’s Lemma

To prove the lower bound in the L2-norm, a more elaborate construction isrequired as compared to estimation at a point (Section 9.3) or in the sup-norm (Section 12.1). The method we use here is a modified version of whatis known in nonparametric statistics as Assouad’s Lemma. This method canbe relatively easily explained if we start with the definitions similar to thosegiven for the result at a fixed point in Theorem 9.16.

We will proceed under the assumptions that the design points are deter-ministic, regular and controlled by a density p(x) which is continuous andstrictly positive in [0, 1]. As in Section 9.3, take a function ϕ(u) ≥ 0, u ∈ R,that satisfies all the properties mentioned in the that section. The key prop-erties are that this function is smooth and its support is [−1, 1].

In the proof of Theorem 9.16, we defined the two test functions f0(t)and f1(t), t ∈ [0, 1]. To extend this definition to the L2-norm, consider Qbins B1, . . . , BQ, centered at cq, q = 1, . . . , Q, each of the length 2h∗n where

h∗n = n−1/(2β+1). Without loss of generality, Q = 1/(2h∗n) is an integer.Denote by ΩQ a set of Q-dimensional binary vectors

ΩQ ={ω : ω = (ω1, . . . , ωQ), ωq ∈ {0, 1} , q = 1, . . . , Q

}.

Page 183: Mathematical Statistics - 213.230.96.51:8090

172 12. Asymptotic Optimality in Global Norms

The number of elements in ΩQ is equal to 2Q. To study the lower bound inthe L2-norm, define 2Q test functions by

(12.12) f(t, ω) = ω1

(h∗n)β

ϕ( t− c1

h∗n

)+ · · ·+ ωQ

(h∗n)β

ϕ(t− cQ

h∗n

)

where the variable t belongs to the interval [0, 1] and ω ∈ ΩQ. A properchoice of ϕ(u) guarantees that each function f(t, ω) belongs to the Holderclass Θ(β, L, L1).

Before continuing we introduce more notation. Define by Yq the σ-algebra generated by the regression observations yi = f(xi) + εi with the

design points xi ∈ Bq, q = 1, . . . , Q. For any estimator fn of the regressionfunction, we define the conditional expectation

fn, q = Ef

[fn∣∣Yq

].

Note that fn, q = fn, q(t) depends only on the observations within the binBq.

For the sake of brevity, below we denote the conditional expectationEf(·,ω)[ · | X ] by Eω[ · ], suppressing dependence on the test function and thedesign.

By the definition of the L2-norm, we obtain that

[‖ fn(·)− f(·, ω) ‖22

]=

Q∑

q=1

[‖ fn(·)− f(·, ω) ‖22, Bq

].

Lemma 12.5. For any estimator fn of the regression function f(t, ω) thefollowing inequality holds:

[‖ fn(·)− f(·, ω) ‖22, Bq

]≥ Eω

[‖ fn, q(·)− f(·, ω) ‖22, Bq

]

(12.13) = Eω

[ ∫

Bq

(fn, q(t)− ωq

(h∗n)β

ϕ(t− cq

h∗n

))2dt].

Proof. First conditioning on the σ-algebra Yq, and then applying Jensen’sinequality to the convex quadratic function, we obtain that for any q =1, . . . , Q,

[‖ fn(·)− f(·, ω) ‖22, Bq

]= Eω

[Eω

[‖ fn(·)− f(·, ω) ‖22, Bq

∣∣Yq

] ]

≥ Eω

[‖Eω

[fn(·)

∣∣Yq

]− f(·, ω) ‖22, Bq

]

= Eω

[‖ fn, q(·)− f(·, ω) ‖22, Bq

](by definition of fn, q)

= Eω

[ ∫

Bq

(fn, q(t)− ωq

(h∗n)βϕ(t− cq

h∗n

))2dt].

Page 184: Mathematical Statistics - 213.230.96.51:8090

12.2. Bound in L2-Norm. Assouad’s Lemma 173

At the last step we used the definition of the function f(t, ω) in the binBq. �

In (12.13), the function fn, q(t) depends only on the regression observa-tions with the design points in Bq. We will denote the expectation relativeto these observations by Eωq . We know that Eωq is computed with respectto one of the two probability measures P{ωq=0} or P{ωq=1}. These measuresare controlled entirely by the performance of the test function f(·, ω) in thebin Bq.

Lemma 12.6. There exists a constant r0, which depends only on the designdensity p and the chosen function ϕ, such that for any q, 1 ≤ q ≤ Q, andfor any Yq-measurable estimator fn, q, the following inequality holds:

maxωq ∈{0, 1}

Eωq

[ ∫

Bq

(fn, q(t)− ωq

(h∗n)βϕ(t− cq

h∗n

))2dt]≥ r0/n.

Proof. We proceed as in the proof of Theorem 9.16. At any fixed t, t ∈ Bq,we obtain that

maxωq∈{0,1}

Eωq

[ ∫

Bq

(fn, q(t)− ωq

(h∗n)βϕ(t− cq

h∗n

))2dt]

≥ 1

2E{ωq=0}

[ ∫

Bq

f2n,q(t)dt

]+

1

2E{ωq=1}

[ ∫

Bq

(fn,q(t)−

(h∗n)βϕ(t− cq

h∗n

))2dt]

(12.14)

=1

2E{ωq =0}

[ ∫

Bq

f2n, q(t) dt+

dP{ωq =1}dP{ωq =0}

Bq

(fn, q−

(h∗n)βϕ(t− cq

h∗n

))2dt]

where

lndP{ωq=1}dP{ωq=0}

= σn, q Nq − 1

2σ2n, q

with a standard normal random variable Nq and

limn→∞

σ2n, q = σ−2 p(cq) ‖ϕ ‖22.

For all large n and any q = 1, . . . , Q , the standard deviation σn, q is separatedaway from zero and infinity. Hence,

P{ωq=0}(Nq > σn, q/2

)≥ p0

for a positive constant p0 independent of n and q. If the random event{Nq > σn, q/2 } holds, then we can estimate the likelihood ratio on theright-hand side of (12.14) from below by 1.

Next, note that for any functions fn and g, the inequality is true ‖fn‖22 +‖fn − g‖22 ≥ (1/2) ‖g‖22 . Applied to fn, q, it provides the lower bound

Bq

[f2n, q(t) +

(fn, q −

(h∗n)βϕ( t− cq

h∗n

))2 ]dt

Page 185: Mathematical Statistics - 213.230.96.51:8090

174 12. Asymptotic Optimality in Global Norms

≥ 1

2

(h∗n)2β∫

Bq

[ϕ( t− cq

h∗n

) ]2dt =

1

2

(h∗n)2β+1 ‖ϕ ‖22 =

1

2n‖ϕ ‖22.

Finally, combining these estimates, we obtain that

maxωq ∈{0, 1}

Eωq

[ ∫

Bq

(fn, q(t)−ωq

(h∗n)βϕ(t− cq

h∗n

))dt ≥ p0‖ϕ ‖22/(2n) = r0/n

with r0 = p0‖ϕ ‖22/2. �

After these technical preparations, we are ready to formulate the mini-max lower bound for estimation of the Holder class functions in the L2-norm.

Theorem 12.7. Let the deterministic design X be defined by (7.17) with acontinuous and strictly positive density p(x) in [0, 1]. There exists a positive

constant r∗ such that for any estimator fn(t), the following asymptotic lowerbound holds:

lim infn→∞

supf∈Θ(β,L)

n2β/(2β+1)Ef

[‖ fn − f ‖22

]≥ r∗.

Proof. We use the notation introduced in Lemmas 12.5 and 12.6. Applyingthe former lemma, we obtain the inequalities

supf∈Θ(β,L)

Ef

[‖ fn − f ‖22

]≥ max

ω ∈ΩQ

Eω ‖ fn(·)− f(· ,ω) ‖22

≥ maxω∈ΩQ

Q∑

q=1

Eωq

[ ∫

Bq

[fn , q(t) − ωq

(h∗n)β

ϕ( t− cq

h∗n

) ]dt].

Note that each term in the latter sum depends only on a single componentωq. This is true for the expectation and the integrand. That is why themaximum over the binary vector ω can be split into the sum of maxima. Inview of Lemma 12.6, we can write

Q∑

q=1

maxωq ∈{0, 1}

Eωq

[ ∫

Bq

(fn, q(t)− ωq

(h∗n)βϕ( t− cq

h∗n

))dt]

≥ r0Q/n = r0 / (2h∗n n ) = (r0/2)n

−2β/(2β+1),

and the theorem follows with r∗ = r0/2. �

12.3. General Lower Bound

The proof of the lower bound in the previous sections explored the charac-teristics of the sup-norm and the L2-norm, which do not extend very far.In particular, in the proof of the lower bound in the sup-norm, we relied onthe independence of the random variables Nq,n in (12.9). A similar indepen-dence does not hold for the test functions (12.12) since their supports areoverlapping. On the other hand, the idea of Assouad’s lemma fails if we try

Page 186: Mathematical Statistics - 213.230.96.51:8090

12.3. General Lower Bound 175

to apply it to the sup-norm because the sup-norm does not split into thesum of the sup-norms over the bins.

In this section, we will suggest a more general lower bound that coversboth of these norms as special cases. As above, we consider a nonparamet-ric regression function f(x), x ∈ [0, 1], of a given smoothness β ≥ 1. Weintroduce a norm ‖f‖ of functions in the interval [0, 1]. This norm will bespecified later in each particular case.

As in the sections above, we must care about two things: a properset of the test functions, and the asymptotic performance of the respectivelikelihood ratios.

Assume that there exists a positive number d0 and a set of M + 1 testfunctions f0(x), . . . , fM (x), x ∈ [0, 1], such that any two functions fl andfm are separated by at least 2d0, that is,

(12.15) ‖ fl − fm ‖ ≥ 2d0 for any l = m, l,m = 0, . . . ,M.

The constant d0 depends on n, decreases as n → 0, and controls the rateof convergence. The number M typically goes to infinity as n → 0. Forexample, in the case of the sup-norm, we had d0 = O

((h∗n)

β)in (12.4), and

M = Q = O(1/h∗n

)where h∗n =

((lnn)/n

)1/(2β+1).

In this section, we consider the regression with the regular deterministicdesign X . Denote by Pm( · ) = Pfm( · | X ) m = 0, . . . ,M , the probabilitydistributions corresponding to a fixed design, and by Em the respectiveexpectations associated with the test function fm, m = 0, . . . ,M.

Fix one of the test functions, for instance, f0. Consider all log-likelihoodratios for m = 1, . . . ,M ,

lndP0

dPm= − 1

2σ2

n∑

i=1

[y2i − (yi − fm(xi))

2]

=1

σ

n∑

i=1

fm(xi)(−εi/σ) − 1

2σ2

n∑

i=1

f2m(xi) = σm,nNm,n − 1

2σ2m,n

where

εi = yi − f(xi) and σ2m,n = σ−2

n∑

i=1

f2m(xi).

The random variables εi and Nm,n are standard normal with respect to thedistribution Pm.

We need assumptions on the likelihood ratios to guarantee that they arenot too small as n → 0. Introduce the random events

Am = {Nm,n > 0} with Pm

(Am

)= 1/2, m = 1, . . . ,M.

Page 187: Mathematical Statistics - 213.230.96.51:8090

176 12. Asymptotic Optimality in Global Norms

Assume that there exists a constant α, 0 < α < 1, such that all the variancesσ2m,n are bounded from above,

(12.16) max1≤m≤M

σ2m,n ≤ 2α lnM.

If the random event Am takes place and the inequality (12.16) holds, then

(12.17)dP0

dPm≥ exp

{− σ2

m,n/2}

≥ exp{−α lnM} = 1/Mα.

Let fn be an arbitrary estimator of the regression function f. Define therandom events

Dm = { ‖ fn − fm ‖ ≥ d0 }, m = 0, . . . ,M.

The following lemma plays the same fundamental role in the proof ofthe lower bound as Lemma 12.2 in the case of the sup-norm.

Lemma 12.8. If the conditions (12.15) and (12.16) are satisfied, then thefollowing lower bound is true:

max0≤m≤M

Pm

(Dm

)≥ 1/4.

Proof. To start with, note that

Pm

(Dm

)= Pm

(DmAm

)+ Pm

(DmAm

)

≤ Pm

(DmAm

)+ Pm

(Am

)= Pm

(DmAm

)+ 1/2,

which implies the inequality

(12.18) Pm

(DmAm

)≥ Pm

(Dm

)− 1/2.

Next, the following inclusion is true:

(12.19)

M⋃

m=1

Dm ⊆ D0

where the random events Dm are mutually exclusive. Indeed, if the normof the difference ‖fn − fm‖ is strictly less than d0 for some m, then by the

triangle inequality and (12.15), the norm ‖fn − fl‖ is not smaller than d0for any l = m. The inclusion (12.19) makes use of this fact for l = 0.

It immediately follows that

P0

(D0

)≥ P0

( M⋃

m=1

Dm

)=

M∑

m=1

P0

(Dm

)=

M∑

m=1

Em

[ dP0

dPmI(Dm

) ]

≥M∑

m=1

Em

[ dP0

dPmI(DmAm

) ]≥ 1

M∑

m=1

[Pm

(Dm

)− 1/2

].

In the latter inequality, we used (12.17).

Page 188: Mathematical Statistics - 213.230.96.51:8090

12.4. Examples and Extensions 177

The final step of the proof is straightforward. The maximum is estimatedfrom below by a mean value,

max0≤m≤M

Pm

(Dm

)≥ 1

2

[P0

(D0

)+

1

M

M∑

m=1

Pm

(Dm

) ]

≥ 1

2

[ 1

M∑

m=1

[Pm

(Dm

)− 1/2

]+

1

M

M∑

m=1

Pm

(Dm

) ]

≥ 1

2M

M∑

m=1

[Pm

(Dm

)+ Pm

(Dm

)− 1/2

]= 1/4. �

As a consequence of Lemma 12.8, we obtain a general lower bound.

Theorem 12.9. If the conditions (12.15) and (12.16) are satisfied, then

for any estimator fn and for all n large enough, the following lower boundholds:

(12.20) supf ∈Θ(β)

Ef

[‖ fn − f ‖

]≥ d0/4.

Proof. Applying Lemma 12.8, we obtain that

supf ∈Θ(β)

Ef

[‖ fn − f ‖

]≥ max

0≤m≤MEm

[‖ fn − fm ‖

]

≥ d0 max0≤m≤M

Pm

(Dm

)≥ d0/4. �

12.4. Examples and Extensions

Example 12.10. The sup-norm risk. In the case of the sup-norm, the testfunctions are defined by (12.3) with M = Q. The condition (12.15) followsfrom (12.5) and (12.6) with d0 = (1/2)(h∗n)

β‖ϕ‖∞. Note that for all large nthe following inequality holds:

lnQ =1

2β + 1

(lnn− ln lnn

)− ln 2 ≥ 1

2(2β + 1)lnn.

In view of (12.11), the variance σ2q, n in the expansion (12.9) of ln

(dPq/dP0

)

is bounded from above uniformly in q = 1, . . . , Q,

σ2q, n ≤ c1 lnn ≤ 2(2β + 1)c1 lnQ ≤ 2α lnQ = 2α lnM.

The latter inequality holds if the constant c1 = 2σ−2p∗c0 is so small that(2β + 1)c1 < α. Such a choice of c1 is guaranteed because c0 is however

Page 189: Mathematical Statistics - 213.230.96.51:8090

178 12. Asymptotic Optimality in Global Norms

small. Thus, the condition (12.16) is also fulfilled. Applying Theorem 12.9,we get the lower bound

supf ∈Θ(β)

Ef

[ψ−1n ‖ fn − f ‖∞

]≥ 1

8(h∗n)

β‖ϕ‖∞ = r∗ψn

with the constant r∗ = (1/8)‖ϕ‖∞, and the rate of convergence ψn definedin (12.1). �

Unlike the case of the upper bounds in Chapter 9, “bad” designs do notcreate a problem in obtaining the lower bound in the sup-norm. Intuitively itis understandable because when we concentrate more design points in somebins, we loose them in the other bins. This process reduces the precision ofthe uniform estimation of the regression function. In a sense, the uniformdesign is optimal if we estimate the regression in the sup-norm. We willprove some results in support of these considerations.

Let a design X be of any kind, not necessarily regular. Assume that thereexists a subset M = M(X ) ⊆ { 1, . . . ,M } such that for some α ∈ (0, 1) thefollowing inequality holds:

(12.21) maxm∈M

σ2m,n ≤ 2α lnM.

Let |M| denote the number of elements in M. It turns out that Lemma 12.8remains valid in the following modification.

Lemma 12.11. If the conditions (12.15) and (12.21) are satisfied, then thefollowing lower bound holds:

max0≤m≤M

Pm

(Dm

)≥ |M|

4M.

Proof. Repeating the proof of Lemma 12.8, we find that

P0

(D0

)≥ P0

( M⋃

m=1

Dm

)=

M∑

m=1

P0

(Dm

)=

M∑

m=1

Em

[ dP0

dPmI(Dm

) ]

≥∑

m∈MEm

[ dP0

dPmI(DmAm

) ]≥ 1

m∈M

(Pm

(Dm

)− 1/2

)

where we have used the inequality (12.17). Under (12.21), this inequalityapplies only to the indices m ∈ M. Continuing as in Lemma 12.8, we obtainthe bound

max0≤m≤M

Pm

(Dm

)≥ 1

2

[P0

(D0

)+

1

M

M∑

m=1

Pm

(Dm

) ]

≥ 1

2M

m∈M

[Pm

(Dm

)+ Pm

(Dm

)− 1/2

]=

|M|4M

. �

Page 190: Mathematical Statistics - 213.230.96.51:8090

12.4. Examples and Extensions 179

Example 12.12. The sup-norm risk (cont’d). For an arbitrary design X ,the bound (12.11) is no longer true. But it turns out (see Exercise 12.82)that for any design X and for any α ∈ (0, 1), there exists a “bump” functionϕ and a subset M = M(X ) ⊆ { 1, . . . , Q } such that

(12.22) |M| ≥ Q/2 and maxq∈M

σ2q, n ≤ 2α lnQ.

From (12.22) and Lemma 12.11, analogously to the proof of Theorem 12.9,we derive the lower bound for any design X ,

(12.23) supf ∈Θ(β)

Ef

[ψ−1n ‖ fn − f ‖∞

]≥ |M|

4Qd0 ≥ d0

8=

1

16(h∗n)

β‖ϕ‖∞. �

Next, we will study the case of the L2-norm risk.

Example 12.13. The L2-norm risk. Consider the test functions f(t, ω), ω ∈Ω, defined in (12.12). For any two functions f(t, ω′) and f(t, ω′′), the log-likelihood function has the representation

(12.24) lnPf(·,ω′)

Pf(·,ω′′)= σnNn − 1

2σ2n

where Nn = Nn(ω′ , ω′′) is a standard normal random variable with respect

to the distribution controlled by the test function f(·, ω′′), and

σ2n = σ2

n(ω′, ω′′) = σ−2

n∑

i=1

(f(xi, ω

′)− f(xi, ω′′))2

where the xi’s are the design points (see Exercise 12.84). From the definitionof the test functions, the variance σ2

n can be bounded from above by

σ2n = σ−2 (h∗n)

2βQ∑

q=1

|ω′q − ω′′

q |∑

xi ∈Bq

ϕ2(xi − cq

h∗n

)

= σ−2‖ϕ‖22Q∑

q=1

|ω′q − ω′′

q |p(cq)(1 + oq, n(1))

(12.25) ≤ σ−2‖ϕ‖22Q(1 + on(1)

)≤ 2σ−2‖ϕ‖22Q.

In the above, oq, n(1) → 0 as n → ∞ uniformly in q, 1 ≤ q ≤ Q. Also, webounded |ω′

q − ω′′q | by 1, and used the fact that the Riemann sum of the

design density approximates the integral

Q−1Q∑

q=1

p(cq) =

∫ 1

0p(x) dx + on(1) = 1 + on(1).

Page 191: Mathematical Statistics - 213.230.96.51:8090

180 12. Asymptotic Optimality in Global Norms

Next, we have to discuss the separation condition (12.15). For any testfunctions, the L2-norm of the difference is easy to find,

(12.26) ‖ f(xi, ω′)− f(xi, ω′′) ‖22 =

1

n‖ϕ‖22

Q∑

q=1

|ω′q − ω′′

q |.

At this point, we need a result which will be proved at the end of this section.

Lemma 12.14. (Warshamov-Gilbert) For all Q large enough, there ex-

ists a subset Ω0, Ω0 ⊂ Ω, with the number of elements no less than 1 + eQ/8

and such that for any ω′ ,ω′′ ∈ Ω0, the following inequality holds:

Q∑

q=1

|ω′q − ω′′

q | ≥ Q/8.

Continuing with the example, let M = eQ/8. From Lemma 12.14 and(12.26), we see that there exist M + 1 test functions such that for any twoof them,

‖ f(xi, ω′)− f(xi, ω′′) ‖22 =

Q

8n‖ϕ‖22 = (2d0)

2

where

d0 =1

2‖ϕ‖2

√Q

8n=

1

2‖ϕ‖2

1√16h∗nn

=1

8‖ϕ‖2(h∗n)β.

Hence the condition (12.15) is fulfilled with this d0. We arbitrarily choosef0 = f(t, ω0) for some ω0 ∈ Ω0, and take M as a set of the rest of thefunctions with ω ∈ Ω0. In this case, |M| = M = eQ/8.

Finally, we have to verify the condition (12.16). If we choose a “bump”function ϕ such that ‖ϕ‖22 = σ2α/8 where α is any number, 0 < α < 1, thenit follows from (12.25) that

σ2n ≤ 2σ−2‖ϕ‖22Q = 2α ln(eQ/8) = 2α lnM.

Theorem 12.9 applies, and the lower bound of the L2-norm risk follows forall large n,

supf ∈Θ(β)

Ef

[‖fn − f‖2

]≥ 1

4d0 =

1

32‖ϕ‖2(h∗n)β = r∗ n

−β/(2β+1) . �

Proof of Lemma 12.14. Define the binary vectors

ωm =(ω1,m, . . . , ωQ,m

), m = 0, . . . ,M,

with the independent Bernoulli(1/2) random components ωq,m. Note thatfor any l = m, the random variables ξq = |ωq, l−ωq,m| are also Bernoulli(1/2),and are independent for different q.

Page 192: Mathematical Statistics - 213.230.96.51:8090

12.4. Examples and Extensions 181

Next, the elementary inequalities and the choice of M yield that

P

( ⋂

0≤ l <m≤M

{ Q∑

q=1

|ωq, l − ωq,m|}

≥ Q/8)

= 1 − P

( ⋃

0≤ l <m≤M

{ Q∑

q=1

|ωq, l − ωq,m|}

< Q/8)

≥ 1 − M(M + 1)

2P

( Q∑

q=1

ξq < Q/8)

≥ 1 − eQ/4P

( Q∑

q=1

ξq < Q/8)

≥ 1 − eQ/4P

( Q∑

q=1

ξ′q > (3/8)Q)

where we denoted by ξ′q = 1/2−ξq the random variables that take on values±1/2 with equal probabilities.

Further, Chernoff’s inequality P(X ≥ a) ≤ e−zaE[ezX], a, z > 0, en-

sures that for any positive z,

P

( Q∑

q=1

ξ′q > (3/8)Q)≤(E[exp

{zξ′q}] )Q

exp{− (3/8)zQ

}.

The moment generating function of ξ′q satisfies the inequality (see Exercise12.85)

(12.27) E[exp

{zξ′q}]

=1

2

(exp{z/2} + exp{−z/2}

)≤ exp{z2/8}.

Take z = 3/2. Then

P

( Q∑

q=1

ξ′q > (3/8)Q)

≤ exp{− (9/32)Q

}.

Hence,

P

( ⋂

0≤ l <m≤M

{ Q∑

q=1

|ωq, l−ωq,m|}

≥ Q/8)

≥ 1 − exp{(1

4− 9

32

)Q}

> 0.

This proves the lemma, because what happens with a positive probabilityexists. �

Page 193: Mathematical Statistics - 213.230.96.51:8090

182 12. Asymptotic Optimality in Global Norms

Exercises

Exercise 12.82. Prove (12.22).

Exercise 12.83. Use (12.22) to prove Theorem 12.4.

Exercise 12.84. Verify (12.24).

Exercise 12.85. Prove (12.27).

Exercise 12.86. Let the design X be equidistant, that is, with the designpoints xi = i/n. Show by giving an example that the following lower boundis false. For any large c there exists a positive p0 independent of n such thatfor all large n, the following inequality holds:

inffn

supf ∈Θ(β)

Pf

(‖fn − f‖2 ≥ cn−β/(2β+1)

∣∣∣X)

≥ p0.

Hint: Consider the case β = 1, and let f∗n be a piecewise constant estimator

in the bins. Show that the above probability goes to zero as n increases.

Page 194: Mathematical Statistics - 213.230.96.51:8090

Part 3

Estimation inNonparametric Models

Page 195: Mathematical Statistics - 213.230.96.51:8090
Page 196: Mathematical Statistics - 213.230.96.51:8090

Chapter 13

Estimation ofFunctionals

13.1. Linear Integral Functionals

As in the previous chapters, here we consider the observations of a regres-sion function f in the presence of the Gaussian random noise. To ease thepresentation, we concentrate on the case of the equidistant design,

(13.1) yi = f(i/n) + εi , εi ∼ N (0, σ2), i = 1, . . . , n.

So far we have studied the estimation problem of the regression function.We found that the typical parametric

√n-rate of convergence is not attain-

able in nonparametric setup. The typical minimax rate under smoothnessparameter β equals ψn = n−β/(2β+1). Note that the exponent β/(2β+1) ap-proaches 1/2 as β goes to infinity. Thus, for a very smooth nonparametricregression, the rate of convergence is close to the typical parametric rate.

In this section we focus on estimating an integral functional of the re-

gression function, for example, Ψ(f) =∫ 10 f(x)dx. We address the question:

What is the minimax rate of convergence in this estimation problem? Wewill show that

√n is a very common rate of convergence.

We start with the easiest problem of a linear integral functional

(13.2) Ψ(f) =

∫ 1

0w(x)f(x) dx

where w(x) is a given Lipschitz function, called the weight function, and f =f(x) is an unknown regression observed with noise as in (13.1). Along with

185

Page 197: Mathematical Statistics - 213.230.96.51:8090

186 13. Estimation of Functionals

the integral notation, we will use the dot product notation, Ψ(f) = (w, f)

and ‖w‖22 =∫ 10 w2(x) dx.

Note that Ψ(f) defined by (13.2) is a linear functional, that is, for anyf1 and f2, and any constants k1 and k2, the following identity holds:

Ψ(k1f1 + k2f2) =

∫ 1

0w(x)

(k1f1(x) + k2f2(x)

)dx

(13.3) = k1

∫ 1

0w(x)f1(x) dx + k2

∫ 1

0w(x)f2(x) dx = k1Ψ(f1) + k2Ψ(f2).

Define an estimator of Ψ(f) by

(13.4) Ψn =1

n

n∑

i=1

w(i/n)yi.

Example 13.1. If w(t) = 1, then Ψ(f) =∫ 10 f(x) dx. Assume that f ∈

Θ(β, L, L1) with some β ≥ 1, which yields that f is a Lipschitz function.In this case, the trivial estimator, the sample mean, turns out to be

√n-

consistent,

Ψn =(y1 + · · ·+ yn

)/n =

∫ 1

0f(x)dx + O(n−1) + σZ0/

√n

where Z0 is a standard normal random variable, and O(n−1) representsthe deterministic error of the Riemann sum approximation. Note that thisdeterministic error is uniform over f ∈ Θ(β, L, L1). �

Next, we state a proposition the proof of which is straightforward (seeExercise 13.87).

Proposition 13.2. For all β ≥ 1 and any f ∈ Θ(β, L, L1), the bias and thevariance of the estimator (13.4) are respectively equal to

bn = Ef [Ψn] − Ψ(f) = O(n−1)

and

Varf [Ψn] =σ2

n

∫ 1

0w2(x) dx + O(n−2).

Corollary 13.3. It immediately follows from Proposition 13.2 that for anyf ∈ Θ(β, L, L1), the following limit exists:

limn→∞

Ef

[√n(Ψn −Ψ(f)

)]2= σ2

∫ 1

0w2(x) dx = σ2‖w‖22.

A legitimate question is whether it is possible to improve the result ofProposition 13.2 and to find another estimator with an asymptotic variancesmaller than σ2 ‖w‖22. As we could anticipate, the answer to this question isnegative. To prove the lower bound, we need the following auxiliary result.

Page 198: Mathematical Statistics - 213.230.96.51:8090

13.1. Linear Integral Functionals 187

Lemma 13.4. Let the Yi’s be independent observations of a location param-eter θ in the non-homogeneous Gaussian model

Yi = θ μi + εi, εi ∼ N (0, σ2i ), i = 1, . . . , n,

with some constant μi’s. Assume that there exists a strictly positive limit

I∞ = limn→∞ n−1In > 0 where In =∑n

i=1

(μi/σi

)2is the Fisher informa-

tion. Then for any estimator θn of the location parameter θ, the followinglower bound holds:

lim infn→∞

supθ ∈R

[√nI∞(θn − θ)

]2= lim inf

n→∞supθ ∈R

[√In(θn − θ)

]2 ≥ 1.

Proof. The statement of the lemma is the Hajek-LeCam lower bound. Notethat the Fisher information in the non-homogeneous Gaussian model is

equal to In =∑n

i=1

(μi/σi

)2, and the log-likelihood ratio is normal non-

asymptotically,

Ln(θ) = Ln(0) =n∑

i=1

ln(f(Xi, θ)/f(Xi, 0)

)=√

InZ0,1θ − 1

2Inθ

2

where Z0,1 is a standard normal random variable with respect to the truedistribution P0. Thus, as in the Hajek-LeCam case, we have the lower bound

lim infn→∞

supθ ∈R

[√In(θn − θ)

]2 ≥ 1. �

Now we return to the functional estimation problem. Consider a one-parameter family of regression functions

Θ ={f(x, θ) = θw(x)/‖w‖22, θ ∈ R, ‖w‖22 > 0

}

where w(x) is the weight function specified in (13.2). For this family ofregression functions, the functional Ψ

(f( · , θ)

)coincides with θ identically,

Ψ(f( · , θ)

)=

∫ 1

0w(x)f(x, θ) dx =

∫ 1

0w(x) θ

w(x)

‖w‖22dx = θ.

Hence for this family of regression functions, the estimation of Ψ(f) is equiv-alent to estimation of θ from the following observations:

(13.5) yi = θ w(i/n)/‖w‖22 + εi , εi ∼ N (0 σ2), i = 1, . . . , n.

Theorem 13.5. For any estimator Ψn from the observations (13.5), thefollowing asymptotic lower bound holds:

(13.6) lim infn→∞

supf(·, θ)∈Θ

Ef(·, θ)[√

n(Ψn −Ψ

(f( · , θ)

) ) ]2≥ σ2‖w‖22.

Page 199: Mathematical Statistics - 213.230.96.51:8090

188 13. Estimation of Functionals

Proof. Applying Lemma 13.4 with μi = w(i/n)/‖w||22 and Yi = yi, we findthat the Fisher information in this case is expressed as

In =n∑

i=1

(μi/σi

)2=

n∑

i=1

w2(i/n)

σ2‖w‖42

=n

σ2‖w‖221

n

n∑

i=1

w2(i/n)

‖w‖22=

n(1 + on(1))

σ2‖w‖22.

Here we used the fact that the latter sum is the Riemann sum for the integral∫ 10 w2(x)/‖w‖22 dx = 1. Thus, from Lemma 13.4, the lower bound follows

lim infn→∞

supf(·, θ)∈Θ

Ef(·, θ)[√

In

(Ψn −Ψ

(f( · , θ)

) ) ]2

= lim infn→∞

supθ∈R

[√

n(1 + on(1))

σ2‖w‖22(Ψn − θ

) ]2≥ 1,

which is equivalent to (13.6). �

13.2. Non-Linear Functionals

As an example, suppose we want to estimate the square of the L2-norm ofthe regression function f , that is, we want to estimate the integral quadraticfunctional

(13.7) Ψ(f) = ‖f‖22 =∫ 1

0f2(x) dx .

Clearly, this is a non-linear functional of f since it does not satisfy thelinearity property (13.3), though it is very smooth. Can we estimate it√n-consistently? The answer is positive. The efficient estimator of the

functional (13.7) is discussed in Example 13.6 below. Now we turn to generalsmooth functionals.

A functional Ψ(f) is called differentiable on a set of functions Θ, if forany fixed function f0 ∈ Θ and for any other function f in Θ, the followingapproximation holds:

(13.8) Ψ(f) = Ψ(f0) + Ψ′f0(f − f0) + ρ(f, f0)

where Ψ′f0(f−f0) is the first derivative of Ψ applied to the difference f−f0.

The functional Ψ′f0

is a linear functional that depends on f0. Moreover, we

assume that for any function g = g(x),

Ψ′f0(g) =

∫ 1

0w(x, f0) g(x) dx,

Page 200: Mathematical Statistics - 213.230.96.51:8090

13.2. Non-Linear Functionals 189

where w(x, f0) is a Lipschitz function of x and a continuous functional off0, that is,

|w(x1, f0)− w(x2, f0)| ≤ L|x1 − x2|with a Lipschitz constant L independent of f0, and

‖w( · , f)− w( · , f0)‖2 → 0 as ‖f − f0‖2 → 0.

The remainder term ρ(f, f0) in (13.8) satisfies the inequality

(13.9) ρ(f, f0) ≤ Cρ‖f − f0‖22with some positive constant Cρ independent of f and f0. Since the functionalΨ(f) is known, the weight function of its derivative w( · , f0) is also knownfor all f0.

Example 13.6. Consider the quadratic functional (13.7). From the identity

‖f‖22 = ‖f0‖22 + 2(f0, f − f0) + ‖f − f0‖22,we have the explicit formula for the derivative

Ψ′f0(g) = 2(f0, g) = 2

∫ 1

0f0(x)g(x) dx.

This formula implies that the weight function w(x, f0) = 2f0(x). The weightfunction is a Lipschitz function if f0 belongs to a class of Lipschitz functions.The remainder term in this example, ρ(f0, f) = ‖f − f0‖22, meets the con-dition (13.9) with Cρ = 1. �

The next proposition claims that a differentiable functional can be esti-mated

√n-consistently and describes the asymptotic distribution.

Theorem 13.7. Assume that the regression function f ∈ Θ(β, L, L1) withsome β ≥ 1. Let Ψ(f) be a differentiable functional on Θ(β, L, L1). Thereexists an estimator Ψ∗

n such that the Pf -distribution of its normalized erroris asymptotically normal as n → ∞,

√n(Ψ∗

n −Ψ(f))→ N

(0, σ2‖w( · , f)‖22

).

Proof. Split n observations into the two sub-samples of size m and n−m,respectively, with m = nα , 5/6 ≤ α < 1. Assume that n/m is an inte-ger. Define the first sub-sample by the equidistant design points, J1 ={ 1/m, 2/m, . . . ,m/m}. Let the second sub-sample J2 be composed of therest of the design points. Note that J2 is not necessarily regular and itcontains almost all the points, |J2| = n(1 − nα−1) = n

(1 + on(1)

). From

Theorem 10.3, even in the case of the smallest smoothing parameter, β = 1,we can choose an estimator f∗

n of f so that uniformly in f,

Ef

[‖f∗

n − f‖22]≤ r∗m−2/3 = r∗n−2α/3 ≤ r∗n− 5/9 = o

(1/

√n).

Page 201: Mathematical Statistics - 213.230.96.51:8090

190 13. Estimation of Functionals

If f∗n is not a Lipschitz function, we replace it by the projection onto the

set Θ(β, L, L1), which is a convex set. So that we may assume that f∗n is a

Lipschitz function and

(13.10) limn→∞

√nEf

[‖f∗

n − f‖22]= 0.

Introduce an estimator of the functional Ψ by

(13.11) Ψ∗n = Ψ(f∗

n) +1

n

i∈ J2

w(i/n, f∗n) yi −

∫ 1

0w(x, f∗

n)f∗n(x) dx.

Note that Ψ∗n is computable from the data. The smaller portion consisting

of the m observations is used in the preliminary approximation of f by f∗n,

while the larger portion of the n −m observations is used in estimation ofthe derivative Ψ′

f∗n, which is a linear functional. This linear functional is

estimated similarly to (13.4) by n−1∑

i∈ J2w(i/n, f∗

n) yi. From (13.11), bythe definition of a differentiable functional, we obtain that

√n[Ψ∗

n−Ψ(f)]=

√n[Ψ(f∗

n) +1

n

i∈ J2

w(i/n, f∗n) yi −

∫ 1

0w(x, f∗

n)f∗n(x) dx

−(Ψ(f∗

n) +

∫ 1

0w(x, f∗

n

)(f(x)− f∗

n(x))dx + ρ(f, f∗

n)) ]

(13.12) =√n[ 1n

i∈ J2

w(i/n, f∗n) yi −

∫ 1

0w(x, f∗

n)f(x) dx]+√nρ(f, f∗

n).

In view of (13.10), the remainder term in (13.12) is vanishing as n → ∞,√nEf

[|ρ(f, f∗

n)|]≤ Cρ

√nEf

[‖f∗

n − f‖22]→ 0.

The normalized difference of the sum and the integral in (13.12) is nor-mal with the expectation going to zero as n → ∞, and the variance that,conditionally on f∗

n, is equal to

σ2

n

i∈ J2

w2(i/n, f∗n) =

σ2

n

n∑

i=1

w2(i/n, f∗n) + O(m/n)

= σ2

∫ 1

0w2(x, f∗

n) dx + on(1).

Here we used the fact that m/n = nα−1 = on(1) → 0 as n → 0. By theassumption, the weight function w( · , f0) is continuous in f0, and

∫ 1

0w2(x, f∗

n) dx →∫ 1

0w2(x, f) dx = ‖w( · , f)‖22

as ‖f∗n − f‖2 → 0. Hence, the result of the theorem follows. �

Page 202: Mathematical Statistics - 213.230.96.51:8090

Exercises 191

Example 13.8. Consider again the integral quadratic functional Ψ definedby (13.7). We apply (13.11) to get an explicit expression for the estimatorΨ∗

n of this functional. From Example 13.6, the weight function w(i/n, f∗n) =

2f∗n(i/n), therefore,

Ψ∗n = ‖f∗

n‖22 +2

n

i∈J2

f∗n(i/n) yi − 2

∫ 1

0

(f∗n(x)

)2dx

=2

n

i∈ J2

f∗n(i/n) yi − ‖f∗

n‖22. �

Remark 13.9. If the preliminary estimator f∗n in (13.11) satisfies the con-

ditionnEf

[‖f∗

n − f‖44]= nEf

[ ∫ 1

0(f∗

n(x)− f(x))4 dx]→ 0

as n → ∞, then the estimator Ψ∗n converges in the mean square sense as

well (see Exercise 13.90),

(13.13) limn→∞

Ef

[ (√n(Ψ∗

n −Ψ(f)) )2 ]

= σ2‖w( · , f)‖22. �

Exercises

Exercise 13.87. Verify the statement of Proposition 13.2.

Exercise 13.88. Let a regression function f ∈ Θ(β, L, L1), β ≥ 1, serve asthe right-hand side of the differential equation

Ψ′(x) + Ψ(x) = f(x), x ≥ 0,

with the initial condition Ψ(0) = 0. Assume that the observations yi, i =1, . . . , n, of the regression function f satisfy (13.1). Estimate the solutionΨ(x) at x = 1. Find the asymptotics of the estimate’s bias and variance asn → ∞ . Hint: Check that Ψ(x) = e−x

∫ x0 etf(t) dt.

Exercise 13.89. Show that Ψ(f) =∫ 10 f4(x) dx is a differentiable functional

of the regression function f ∈ Θ(β, L, L1), β ≥ 1.

Exercise 13.90. Prove (13.13).

Exercise 13.91. Consider the observations yi = f(xi) + εi, i = 1, . . . , n,with a regular design governed by a density p(x). Show that the sample meany = (y1 + · · · + yn)/n is the minimax efficient estimator of the functional

Ψ(f) =∫ 10 f(x)p(x) dx.

Page 203: Mathematical Statistics - 213.230.96.51:8090
Page 204: Mathematical Statistics - 213.230.96.51:8090

Chapter 14

Dimension andStructure inNonparametricRegression

14.1. Multiple Regression Model

In this chapter, we revise the material of Chapters 8 and 9, and extend it tothemultiple regression model. As in (8.1), our starting point is the regressionequation Y = f(X) + ε, ε ∼ N (0, σ2). The difference is that this time the

explanatory variable is a d-dimensional vector, X =(X(1), . . . , X(d))′ ∈ R

d .We use the upper index to label the components of this vector. A set ofn observations has the form { (x1, y1), . . . , (xn, yn) } where each regressor

xi =(x(1)i , . . . , x

(d)i

)′is d-dimensional. The regression equation looks similar

to (8.2),

(14.1) yi = f(xi) + εi, i = 1, . . . , n,

where εi∼N (0, σ2) are independent normal random errors. The regressionfunction f : R

d → R is a real-valued function of d variables. This functionis unknown and has to be estimated from the observations (14.1). Assumethat the regressors belong to the unit cube in R

d, xi ∈ [0, 1]d , i = 1, . . . , n.The design X = {x1, . . . ,xn } can be deterministic or stochastic, though inthis chapter we prefer to deal with the regular deterministic designs.

Our principal objective is to explain the influence of the dimension don the asymptotically minimax rate of convergence. We restrict ourselves

193

Page 205: Mathematical Statistics - 213.230.96.51:8090

194 14. Dimension and Structure in Nonparametric Regression

to the estimation problem of f(x0) at a fixed point x0 = (x(1)0 , . . . , x

(d)0 )′

located strictly inside the unit cube [0, 1]d. The asymptotically minimax rateof convergence ψn, defined by (8.4) and (8.5), is attached to a Holder classΘ(β, L, L1) of regression functions. Thus, we have to extend the definitionof the Holder class to the multivariate case. The direct extension via thederivatives as in the one-dimensional case is less convenient since we wouldhave to deal with all mixed derivatives of f up to a certain order. A morefruitful approach is to use the formula (8.14) from Lemma 8.5 as a guideline.

Let β be an integer, β ≥ 1, and let ‖ · ‖ denote the Euclidean normin R

d. A function f(x), x ∈ [0, 1]d, is said to belong to a Holder class offunctions Θ(β) = Θ(β, L, L1) if: (i) there exists a constant L1 > 0 such thatmaxx∈[0,1]d |f(x)| ≤ L1, and (ii) for any x0 ∈ [0, 1]d there exists a polynomial

p(x) = p(x,x0, f) of degree β − 1 such that∣∣ f(x) − p(x,x0 , f)

∣∣ ≤ L‖x − x0‖β

for any x ∈ [0, 1]d, with a constant L independent of f and x0.

To estimate the regression function at a given point x0 we can applyany of the methods developed in the previous chapters. Let us consider thelocal polynomial approximation first. Take a hypercube Hn centered at x0,

Hn ={x = (x(1), . . . , x(d))′ : |x(1) − x

(1)0 | ≤ hn , . . . , |x(d) − x

(d)0 | ≤ hn

}

where the bandwidth hn → 0 as n → ∞. If x0 belongs to an open cube(0, 1)d, then for any large n, the hypercube Hn is a subset of [0, 1]d.

Consider a polynomial π(x), x ∈ [0, 1]d, of degree β−1. Note that in the

case of d predictor variables, there are exactly(i+d−1

i

)monomials of degree

i ≥ 0 (see Exercise 14.92). That is why there are k = k(β, d) coefficientsthat define the polynomial π(x) where

k = k(β, d) =

β−1∑

i=0

(i+ d− 1

i

)=

1

(d− 1)!

β−1∑

i=0

(i+ 1) . . . (i+ d− 1).

We denote the vector of these coefficients by θ ∈ Rk, and explicitly mention

it in the notation for the polynomial π(x) = π(x, θ).

Example 14.1. Let d = 2 and β = 3. A polynomial of degree β − 1 = 2has a general form

π(x, θ) = θ0 + θ1x(1) + θ2x

(2) + θ3(x(1))2 + θ4x

(1) x(2) + θ5(x(2))2

with the vector of unknown coefficients θ = (θ0 , θ1 , θ2 , θ3 , θ4 , θ5)′. To ver-

ify the dimension of θ, compute

k = k(3 , 2) =2∑

i=0

(i+ 1

i

)=

(1

0

)+

(2

1

)+

(3

2

)= 1 + 2 + 3 = 6. �

Page 206: Mathematical Statistics - 213.230.96.51:8090

14.1. Multiple Regression Model 195

For any x in the hypercubeHn centered at x0, we rescale this polynomialto get π

((x−x0)/hn, θ

). Suppose there are N pairs of observations (xi , yi)

such that the design points belong to Hn. Without loss of generality, we mayassume that these are the first N observations, x1, . . . ,xN ∈ Hn. The vectorθ can be estimated by the method of least squares. The estimator θ is thesolution of the minimization problem (cf. (9.1)),

(14.2)N∑

i=1

[yi − π

((x− x0)/hn , θ

) ]2→ min

θ.

As in Section 9.1, we can define a system of normal equations (9.2)where G is the design matrix with dimensions N ×k. The columns of G arecomposed of the monomials in the polynomial π

((x− x0)/hn, θ

)evaluated

at the N design points.

Let Assumption 9.2 hold, that is, we assume that the elements of the

matrix(G′G

)−1are bounded from above by γ0N

−1 with a constant γ0independent of n. Clearly, this assumption is a restriction on the design X .The next proposition is a simplified version of Proposition 9.4.

Proposition 14.2. Suppose Assumption 9.2 holds. Let θ0 = π(0, θ) be the

estimate of the intercept, that is, the first component of the solution θ of(14.2). Then uniformly in f ∈ Θ(β, L, L1) , we have that

θ0 − f(x0) = b0 + N0

where |b0| ≤ Cbhβn, and N0 is a zero-mean normal random variable with

the variance Varf[N0 | X

]≤ Cv/N . The positive constants Cb and Cv are

independent of n.

Proof. From the definition of the Holder class Θ(β, L, L1), we obtain (cf.(9.3)),

yi = p(xi,x0, f) + ρ(xi,x0, f) + εi , i = 1, . . . , n,

where the remainder term

|ρ(xi,x0, f)| ≤ ‖xi − x0‖β ≤(√

dhn)β.

Repeating the proofs of Lemmas 9.1 and 9.3, we find that the least-squares estimator θ actually estimates the vector of coefficients of the poly-nomial p(x,x0, f) in the above approximation. The deterministic error here

that does not exceed Cbhβn, and the zero-mean normal stochastic term has

the variance that is not larger than Cv/N. By definition, the zero-order termof the approximation polynomial θ0 is equal to f(x0). Hence, the estimateof the intercept

θ0 = θ0 + b0 + N0 = f(x0) + b0 + N0

Page 207: Mathematical Statistics - 213.230.96.51:8090

196 14. Dimension and Structure in Nonparametric Regression

satisfies the claim of the proposition. �

Finally, we have arrived at the point where the influence of the higherdimension shows up. In Section 9.1, to obtain the minimax rate of con-vergence, we needed Assumption 9.5 which helped to control the stochasticterm. This assumption required that the number N of the design points inthe hn-neighborhood of the given point x0 is proportional to nhn. Clearly,this assumption was meant to meet the needs of regular designs, determin-istic or random. So, the question arises: How many design points can weanticipate in the regular cases in the d-dimensional Hn-neighborhood of x0?Simple geometric considerations show that at best we can rely on a numberproportional to the volume of Hn.

Assumption 14.3. There exists a positive constant γ1, independent of n,such that for all large enough n, the inequality N ≥ γ1nh

dn holds. �

Now we are in the position to formulate the upper bound result.

Theorem 14.4. Suppose that the design X satisfies the conditions of As-sumptions 9.2 and 14.3 with hn = h∗n = n−1/(2β+d). Given X , the quadratic

risk of the local polynomial estimator θ0 = π(0, θ) described in Proposition14.2 admits the upper bound

supf ∈Θ(β,L,L1)

Ef

[ (π( 0, θ) − f(x0)

)2 ∣∣∣X]≤ r∗n−2β/(2β+d)

where a positive constant r∗ is independent of n.

Proof. Analogously to the proof of Theorem 9.6, from Proposition 14.2, theupper bound of the quadratic risk holds uniformly in f ∈ Θ(β, L, L1),

Ef

[ (π(0, θ) − f(x0)

)2 ∣∣∣X]≤ C2

b (hn)2β +

Cv

N≤ C2

b (hn)2β +

Cv

γ1nhdn.

The balance equation in the d-dimensional case has the form (hn)2β =

1/(nhdn). The optimal choice of the bandwidth is hn = h∗n = n−1/(2β+d), and

the respective rate of convergence is (h∗n)β = n−β/(2β+d). �

14.2. Additive regression

The minimax rate of convergence ψn = n−β/(2β+d) in a d-dimensional Holderregression rapidly slows down as d increases. One way to overcome this“curse of dimensionality” is to assume that the smoothness also grows withd. Indeed, if β = dβ1, then the exponent in the rate of convergence β/(2β+d) = β1/(2β1 + 1) matches the one-dimensional case with the smoothnessparameter β1. However, this assumption is very restrictive.

Page 208: Mathematical Statistics - 213.230.96.51:8090

14.2. Additive regression 197

Another approach is to impose some constraints on the structure ofthe regression model. Here we consider one example, a so-called additiveregression model. To understand the role of a higher dimension, it sufficesto look at the case d = 2 and a very basic regular design. Suppose thatin the two-dimensional regression model, the design X is equidistant. Letm =

√n be an integer. The design points, responses, and random errors are

all labeled by two integer indices i and j, where i, j = 1, . . . ,m. We denotethe design points by

(14.3) xij = (i/m, j/m).

Thus, the regression relation takes the form

(14.4) yij = f(i/m, j/m) + εij

where εij are independent N (0, σ2) random variables.

In the additive regression model, the regression function is the sum oftwo functions, both of which depend on one variable,

(14.5) f(x) = f(x(1), x(2)) = f1(x(1)) + f2(x

(2))

where f1 and f2 are the Holder functions of a single variable, f1, f2 ∈Θ(β, L, L1).

This definition of the additive model is not complete, since we can alwaysadd a constant to one term and subtract it from the other one. To makethe terms identifiable, we impose the following conditions:

(14.6)

∫ 1

0f1(t) dt =

∫ 1

0f2(t) dt = 0.

Let x0 be a fixed point strictly inside the unit square. Without loss ofgenerality, we will assume that this point coincides with one of the designknots, x0 = (i0/m, j0/m). Clearly, we could treat the model of observations(14.4) as a two-dimensional regression. The value of the regression function

f(x0) at x0 can be estimated with the rate n−β/(2β+2) suggested by Theorem14.4 for d = 2. A legitimate question at this point is whether it is possibleto estimate f(x0) with a faster rate exploiting the specific structure of themodel. In particular, is it possible to attain the one-dimensional rate ofconvergence n−β/(2β+1)? As the following proposition shows, the answer tothis question is affirmative.

Proposition 14.5. In the additive regression model (14.4)-(14.6) at any

point x0 = (i0/m, j0/m), there exists an estimator fn(x0) such that

supf1,f2 ∈Θ(β,L,L1)

Ef

[ (fn(x0) − f(x0)

)2 ]≤ r∗ n−2β/(2β+1)

Page 209: Mathematical Statistics - 213.230.96.51:8090

198 14. Dimension and Structure in Nonparametric Regression

for all large enough n. Here a constant r∗ is independent of n.

Proof. Select the bandwidth h∗n = n−1/(2β+1) as if the model were one-dimensional. Consider the set of indices

In = In(i0/m) ={i : |i/m− i0/m| ≤ h∗n

}.

The number N of indices in the set In is equal to N = |In| = 2�mh∗n�+ 1 .Note that mh∗n =

√nn−1/(2β+1) → ∞ , and hence N ∼ 2mh∗n as n → ∞.

To estimate f1 at i0/m, consider the means

yi · =1

m

m∑

j=1

yij =1

m

m∑

j=1

[f1(i/m) + f2(j/m)

]+

1

m

m∑

j=1

εij

(14.7) = f1(i/m) + δn +1√m

εi, i ∈ In,

where the deterministic error

δn =1

m

m∑

j=1

f2(j/m)

is the Riemann sum for the integral∫ 10 f2(t) dt = 0, and the random variables

εi =1√m

m∑

j=1

εij ∼ N (0, σ2)

are independent for different i ∈ In.

Applying (14.6), we find that

|δn| ≤ L0/m = L0/√n = o

((h∗n)

β)

as n → ∞with a Lipschitz constant L0 = max |f ′

2| of any function f2 ∈ Θ(β, L, L1).Thus, we have a one-dimensional regression problem with N observationsin the bin centered at i0/m. Applying the one-dimensional local polynomialapproximation (see Lemma 9.3) to the means (14.7), we can estimate f1 atthe point i0/m with the deterministic error not exceeding

Cb(h∗n)

β + |δn| = Cb(h∗n)

β(1 + o(1)

).

The stochastic error of this estimator is normal with the zero expectationand the variance which is not larger than

Cv

N

( σ√m

)2∼ Cv

2mh∗n

( σ√m

)2=

Cvσ2

2nh∗n.

The constants Cb and Cv are independent of n. Here (σ/√m)2 = σ2/m =

σ2/√n represents the variance of the stochastic error of the means (14.7). So,

the one-dimensional balance between the deterministic and stochastic errorsholds, and the one-dimensional rate of convergence (h∗n)

β is guaranteed.

Page 210: Mathematical Statistics - 213.230.96.51:8090

14.3. Single-Index Model 199

Similarly, we can estimate f2 at j0/m with the same one-dimensionalrate. �

Remark 14.6. In this section, we considered the simplest version of theadditive regression model. In more general settings, the design may be de-terministic or random, and the dimension may be any positive integer. Theeffect, however, is still the same: the one-dimensional rate of convergenceis attainable. Clearly, this rate is minimax since in any higher dimension,this rate cannot be improved for the subset of one-dimensional regressionfunctions. �

Remark 14.7. Traditionally the additive regression model includes a con-stant intercept f0. That is, the regression function has the form

(14.8) f(x(1), x(2)) = f0 + f1(x(1)) + f2(x

(2)).

For simplicity, it was omitted in the model considered in this section. Toestimate f0, we could split the sample of observations into two sub-samplesof sizes n/2 each, but this would destroy the regularity of the design. Itis more convenient to consider the model with two independent repeatedobservations yij and yij at the design knots (i/m , j/m). Then we can usethe second set of observations to estimate the intercept,

(14.9) f0 =1

n

m∑

i,j=1

yij .

Now we can replace the observations yij in (14.4) by yij − f0, and use theseshifted observations to estimate f1 and f2 as done above. Then the statementof Proposition 14.5 would stay unchanged (see Exercise 14.93). �

14.3. Single-Index Model

14.3.1. Definition. The additive regression model of the previous sectionprovides an example of a specific structure of the nonparametric regressionfunction. In this section we give another example, known as a single-indexmodel. This name unites a variety of models. We present here a versionthat is less technical.

Consider a two-dimensional regression model with the equidistant designin the unit square [0, 1]2. It is convenient to study the model with twoindependent repeated observations at every design knot xij = (i/m, j/m),

(14.10) yij = f(i/m, j/m) + εij and yij = f(i/m, j/m) + εij

where 1 ≤ i, j ≤ m, and m =√n is assumed to be an integer (cf. Remark

14.7). The random variables εij and εij are independent N (0, σ2).

Page 211: Mathematical Statistics - 213.230.96.51:8090

200 14. Dimension and Structure in Nonparametric Regression

The structure of the regression function f is imposed by the assumptionthat there exist a Holder function g = g(t) and an angle α such that

(14.11) f(i/m, j/m) = g((i/m) cosα+ (j/m) sinα

), 1 ≤ i, j ≤ m.

We will suppose that 0 ≤ α ≤ π/4. The restrictions on the function g aremore elaborate and are discussed below.

Let β ≥ 2 be an integer, and let g∗ be a positive number. Assumethat Θ(β, L, L1) is the class of Holder functions g = g(t) in the interval0 ≤ t ≤

√2, and let

Θ(β, L, L1, g∗) = Θ(β, L, L1) ∩ { g′(t) ≥ g∗ }

be a sub-class of functions the first derivative of which exceeds g∗.

Introduce a class of functions H = H(β, L, L1, g∗) in the unit square

0 ≤ x(1), x(2) ≤ 1 by

H ={f = f(x(1) , x(2)) = g

(x(1) cosα+ x(2) sinα

),

0 ≤ α ≤ π/4 , g ∈ Θ(β, L, L1, g∗)}.

This class is well defined because the variable t = x(1) cosα + x(2) sinαbelongs to the interval 0 ≤ t ≤

√2. The functions in H, if rotated at a

proper angle, depend on a single variable, and are monotonically increasingin the corresponding direction. The point

(14.12) tij = (i/m) cosα+ (j/m) sinα

is the projection (show!) of xij = (i/m, j/m) onto the straight line pass-ing through the origin at the angle α (see Figure 11). If we knew α, wecould compute the projections tij , and the problem would become one-dimensional.

x(1)

x(2)

0

α

xij

1

1

tij

Figure 11. Projection of the design knot xij on the line passing throughthe origin at the angle α.

Page 212: Mathematical Statistics - 213.230.96.51:8090

14.3. Single-Index Model 201

Let x0 = (i0/m, j0/m) be a fixed point. Our objective is to estimatethe value f(x0) of the regression function at this point. Clearly, we canlook at the observations (14.10) as the observations of the two-dimensionalregression. The results of Section 14.1 would guarantee the minimax rate ofestimation n−2β/(2β+2). Can this rate be improved to the one-dimensionalrate n−2β/(2β+1)? The answer is positive, and the algorithm is simple. First,estimate α by αn, and then plug αn into the projection formula (14.12) forthe one-dimensional design points,

tij = (i/m) cos αn + (j/m) sin αn.

The two-sample model of observations (14.10) is convenient, because thefirst sample is used to estimate α, while the second one serves to estimatethe regression function itself. We could work with one sample of size n,and split it into two independent sub-samples, but this would result in lessregular designs.

14.3.2. Estimation of Angle. To estimate α, note that at any point(x(1), x(2)) ∈ [0, 1]2, the partial derivatives of the regression function areproportional to cosα and sinα, respectively,

(14.13)∂f(x(1), x(2))

∂x(1)= g′(x(1) cosα+ x(2) sinα) cosα

and

(14.14)∂f(x(1), x(2))

∂x(2)= g′(x(1) cosα+ x(2) sinα) sinα.

If we integrate the left-hand sides of (14.13) and (14.14) over the square[0, 1]2, we obtain the integral functionals that we may try to estimate. Un-fortunately, these are functionals of partial derivatives of f , not of f itself.However, we can turn these functionals into the functionals of f if we inte-grate by parts.

Choose any function ϕ = ϕ(x(1), x(2)) , (x(1), x(2)) ∈ [0, 1]2. Assumethat ϕ is non-negative and very smooth, for example, infinitely differentiable.Assume also that it is equal to zero identically on the boundary of the unitsquare,

(14.15) ϕ(x(1), x(2)) = 0 for (x(1), x(2)) ∈ ∂[0, 1]2.

Multiplying the left-hand sides of (14.13) and (14.14) by ϕ and integratingby parts over [0, 1]2, we obtain an integral functional of f ,

(14.16) Φl(f) =

∫ 1

0

∫ 1

0ϕ(x(1), x(2))

∂f(x(1), x(2))

∂x(l)dx(1) dx(2)

(14.17) =

∫ 1

0

∫ 1

0wl(x

(1), x(2))f(x(1), x(2)) dx(1) dx(2)

Page 213: Mathematical Statistics - 213.230.96.51:8090

202 14. Dimension and Structure in Nonparametric Regression

where wl are the weight functions

wl(x(1), x(2)) = − ∂ϕ(x(1), x(2))

∂x(l), l = 1 or 2.

The outside-of-the-integral term in (14.17) vanishes due to the boundarycondition (14.15).

Thus, (14.13) and (14.14) along with (14.17) yield the equations

(14.18) Φ1 = Φ1(f) = Φ0 cosα and Φ2 = Φ2(f) = Φ0 sinα

with

Φ0 = Φ0(f) =

∫ 1

0

∫ 1

0ϕ(x(1), x(2))g′(x(1) cosα+ x(2) sinα) dx(1) dx(2).

Under our assumptions, uniformly in f ∈ H, the values of Φ0(f) are sepa-rated away from zero by some strictly positive constant,

(14.19) Φ0(f) ≥ Φ∗ > 0.

Now, given the values of the functionals Φ1 and Φ2, we can restore theangle α from the equation α = arctan

(Φ2/Φ1

). Define the estimators of

these functionals by

(14.20) Φ(l)n =

1

n

m∑

i,j=1

wl(i/m, j/m) yij , l = 1 or 2.

Then we can estimate the angle α by

αn = arctan(Φ(2)n /Φ(1)

n

).

Note that the ratio Φ(2)n /Φ

(1)n can be however large, positive or negative.

Thus, the range of αn runs from−π/2 to π/2, whereas the range of the true αis [0, π/4]. Next, we want to show that the values of αn outside of the interval[0, π/4] are possible only due to the large deviations, and the probability ofthis event is negligible if n is large. As the following proposition shows, theestimator αn is

√n-consistent with rapidly decreasing probabilities of large

deviations. The proof of this proposition is postponed to the next section.

Proposition 14.8. There exist positive constants a0, c0, and c1, indepen-dent of f and n, such that for any x, c0 ≤ x ≤ c1

√n, the following inequality

holds:

Pf

( ∣∣αn − α∣∣ > x/

√n)≤ 4 exp{−a0x

2}.

Page 214: Mathematical Statistics - 213.230.96.51:8090

14.3. Single-Index Model 203

14.3.3. Estimation of Regression Function. We use the second sampleyij of the observations in (14.10) to estimate the regression function f(x0) atthe given knot x0 = (i0/m, j0/m). Recall that tij , as introduced in (14.12),is the projection of (i/m, j/m) onto the line determined by the true angle α .Denote by tij the projection of (i/m, j/m) onto the line determined by theestimated angle αn, and let uij be the projection in the orthogonal directiongiven by the angle αn + π/2, that is,

tij = (i/m) cos αn + (j/m) sin αn,

and

uij = −(i/m) sin αn + (j/m) cos αn.

Let the respective projections of the fixed point x0 = (i0/m, j0/m) be de-noted by t0 and u0. Introduce T , a rectangle in the new coordinates (seeFigure 12),

T ={(t, u) : |t− t0| ≤ h∗n, |u− u0| ≤ H

}

where h∗n = n−1/(2β+1) and H is a constant independent of n and so smallthat T ⊂ [0, 1]2.

��

x(1)

x(2)

0αn

1

1

t0u0

x0 2H

2h∗n T

t

u

Figure 12. Rectangle T in the coordinate system rotated by the angle αn.

Proposition 14.9. For any design knot xij = (i/m, j/m) ∈ T , the obser-vation yij in (14.10) admits the representation

yij = g(tij) + ρij + εij , 1 ≤ i, j ≤ m,

with the remainder term ρij being independent of the random variable εij,and satisfying the inequality

max1≤i,j≤m

|ρij| ≤ 2L0|αn − α|

where L0 = max |g′| is the Lipschitz constant of any g ∈ Θ(β, L, L1, g∗).

Page 215: Mathematical Statistics - 213.230.96.51:8090

204 14. Dimension and Structure in Nonparametric Regression

Proof. Put ρij = g(tij)− g(tij). By definition, ρij depends only on the firstsub-sample yij of observations in (14.10), and hence is independent of εij .

We have yij = g(tij) + ρij + εij . For any knot (i/m, j/m), we obtain

|ρij| = |g(tij)− g(tij)|=∣∣ g((i/m) cosα+ (j/m) sinα

)− g

((i/m) cos αn + (j/m) sin αn

) ∣∣

≤ L0

((i/m)

∣∣ cos αn − cosα∣∣ + (j/m)

∣∣ sin αn − sinα∣∣ )

≤ L0

(i/m + j/m

) ∣∣ αn − α∣∣ ≤ 2L0

∣∣ αn − α∣∣. �

Further, let f(x0) be the local polynomial approximation obtained fromthe observations yij at the design points tij where (i/m, j/m) ∈ T . It means

that f(x0) = θ0 where θ0 is the least-squares estimator of the intercept. Itcan be obtained as a partial solution to the minimization problem

(i/m,j/m)∈T

(yij −

[θ0 + θ1

tij − t0h∗n

+ · · ·+ θβ−1

( tij − t0h∗n

)β−1])2→ min

θ0,...,θβ−1

.

To analyze this minimization problem, we have to verify the Assump-tions 9.2 and 9.5 on the system of normal equations associated with it.Denote by N(T ) the number of design knots in T , and let G′G be thematrix of the system of normal equations with the elements

(G′G

)k, k

=∑

(i/m, j/m)∈T

( tij − t0h∗n

)k+l, k, l = 0, . . . , β − 1.

Note that the number of design knots N(T ), and the elements of the matrixG′G are random because they depend on the estimator αn.

Lemma 14.10. (i) Uniformly in αn, the number of design knots N(T ) sat-isfies

limn→∞

N(T )

4Hnh∗n= 1.

(ii) The normalized elements of the matrix G′G have the limits

limn→∞

1

N(T )

(G′G

)k, l

=1− (−1)k+l+1

2(k + l + 1),

and the limiting matrix is invertible.

Proof. See the next section. �

Theorem 14.11. The estimator f(x0) = θ0 has the one-dimensional rateof convergence on H,

supf∈H

n2β/(2β+1)Ef

[ (f(x0) − f(x0)

)2 ] ≤ r∗

for all sufficiently large n with a constant r∗ independent of n.

Page 216: Mathematical Statistics - 213.230.96.51:8090

14.3. Single-Index Model 205

Proof. Proposition 14.9 and Lemma 14.10 allow us to apply the expansionsimilar to Proposition 9.4. In the case under consideration, this expansiontakes the form

(14.21) θ0 = f(x0) + b0 + N0.

Conditionally on the first sub-sample of observations in (14.10), the biasterm b0 admits the bound

| b0 | ≤ Cb(h∗n)

β + Ca max1≤i, j≤m

|ρij | ≤ Cb(h∗n)

β + 2L0Ca|αn − α|

where the constants Ca and Cb are independent of n. The stochastic term N0

on the right-hand side of (14.21) is a zero-mean normal random variable withthe conditional variance that does not exceed Cv/N(T ) ≤ Cv/(2Hnh∗n).Hence, uniformly in f ∈ H, we have that

Ef

[ (f(x0) − f(x0)

)2 ]= Ef

[ (b0 + N0

)2 ]

≤ 2Ef

[b20]+ 2Ef

[N 2

0

]

≤ 2Ef

[ (Cb(h

∗n)

β + 2L0Ca|αn − α|)2 ]

+ 2Ef

[N 2

0

]

≤ 4C2b (h

∗n)

2β + 16L20C

2a Ef

[(αn − α)2

]+ 2Cv/(2Hnh∗n).

Note that with probability 1, |αn − α| ≤ π. From Proposition 14.8, we canestimate the latter expectation by

Ef

[(αn − α)2

]=

∫ ∞

0z2 dPf

(|αn − α| ≤ z

)

≤(c0/

√n)2

+

∫ c1

c0/√nz2 dPf

(|αn − α| ≤ z

)+

∫ π

c1

z2 dPf

(|αn − α| ≤ z

)

≤ c20/n + c20/n + 2

∫ c1

c0/√nz dPf

(|αn − α| > z

)+ π2

Pf

(|αn − α| > c1

)

≤ 2c20/n + 4

∫ ∞

0exp{−a0nz

2} d(z2) + 4π2 exp{−a0nc21}

≤ 2c20/n + 4/(a0n) + 4π2 exp{−a0nc21} ≤ C1/n

for some positive constant C1 and for all large enough n. Thus,

Ef

[ (f(x0)− f(x0)

)2 ]

≤ 4C2b (h

∗n)

2β + C1/n + 2Cv/(2Hnh∗n) = O((h∗n)

2β).

Here we used the facts that (h∗n)2β = (nh∗n)

−1, and C1/n = o((h∗n)

2β)as

n → ∞. �

Page 217: Mathematical Statistics - 213.230.96.51:8090

206 14. Dimension and Structure in Nonparametric Regression

14.4. Proofs of Technical Results

To prove Proposition 14.8, we need two lemmas.

Lemma 14.12. For any n, the estimator Φ(l)n given by (14.20) of the func-

tionals Φl(f) defined in (14.16) admits the representation

Φ(l)n = Φl(f) + ρ(l)n (f)/

√n + η(l)n /

√n , l = 1 or 2,

where the deterministic remainder term is bounded by a constant∣∣ ρ(l)n (f)

∣∣ ≤

Cρ, and the random variables η(l)n are zero-mean normal with the variances

bounded from above, Varf[η(l)n

]≤ Cv. The constants Cρ and Cv are inde-

pendent of n and f.

Proof. Recall that m =√n is assumed to be an integer. Note that

Φ(l)n =

1

n

m∑

i, j=1

wl(i/m, j/m) f(i/m, j/m) +1

n

m∑

i, j=1

wl(i/m, j/m) εij

=m∑

i, j=1

∫ i/m

(i−1)/m

∫ j/m

(j−1)/mwl(x1, x2) f(x1, x2) dx2 dx1 +

ρ(l)n√n

+η(l)n√n

where

ρ(l)n =√n

m∑

i, j=1

∫ i/m

(i−1)/m

∫ j/m

(j−1)/m

[wl(i/m, j/m) f(i/m, j/m)

−wl(x1, x2) f(x1, x2)]dx2 dx1

and

η(l)n =1√n

m∑

i, j=1

wl(i/m, j/m) εij .

The variance of the normal random variable η(l)n is equal to

Var[η(l)n

]=

σ2

n

m∑

i, j=1

w2l (i/m, j/m)

−→n→∞

σ2

∫ 1

0

∫ 1

0w2l (x1, x2) dx2 dx1 < C2

v < ∞.

The deterministic remainder term ρ(l)n admits the upper bound

∣∣ ρ(l)n

∣∣ ≤ L0m

m∑

i, j=1

∫ i/m

(i−1)/m

∫ j/m

(j−1)/m

( ∣∣x1 − i/m∣∣ +

∣∣x2 − j/m

∣∣ ) dx2 dx1

= L0mm∑

i, j=1

1

m3= L0

Page 218: Mathematical Statistics - 213.230.96.51:8090

14.4. Proofs of Technical Results 207

where L0 = max |(wl f)′| is the Lipschitz constant of the product wl f. �

Lemma 14.13. Let Φ∗, Φ(1)n and Φ

(2)n be as defined in (14.19) and (14.20).

If y satisfies the conditions

max(Cρ, Cv) ≤ y ≤ (4√2)−1Φ∗

√n,

then for all sufficiently large n, and any f ∈ H , the following inequalityholds:

Pf

( ∣∣ Φ(2)n /Φ(1)

n − tanα∣∣ ≤ 12 y

Φ∗√n

)≥ 1 − 4 exp

{− y2

2C2v

}.

Proof. From Lemma 14.12, the random variable η(l)n , l = 1 or 2, is a zero-

mean normal with the variance satisfying Varf[η(l)n

]≤ Cv. Therefore, if

y ≥ Cv, then uniformly in f ∈ H, we have

Pf

(|η(l)n | > y

)≤ 2 exp

{− y2/(2C2

v )}, l = 1 or 2.

Hence, with the probability at least 1− 4 exp{−y2/(2C2

v )}, we can assume

that |η(l)n | ≤ y for both l = 1 and 2 simultaneously. Under these conditions,in view of (14.18) and Lemma 14.12, we obtain that

∣∣ Φ(2)

n /Φ(1)n − tan α

∣∣ =

∣∣∣∣cosα(ρ

(2)n + η

(2)n )/

√n − sinα(ρ

(1)n + η

(1)n )/

√n

cosα(Φ0(f) cosα+ (ρ

(1)n + η

(1)n )/

√n)

∣∣∣∣

≤∣∣∣∣

√2(cosα+ sinα)(Cρ + y)

Φ∗√

n/2− (Cρ + y)

∣∣∣∣ ≤∣∣∣∣

2(Cρ + y)

Φ∗√

n/2− (Cρ + y)

∣∣∣∣

where we used the facts that cosα ≥ 1/√2 since 0 ≤ α ≤ π/4, and, at the

last step, that sinα+ cosα ≤√2.

Further, by our assumption, Cρ ≤ y and 2y ≤ (1/2)Φ∗√n/2, therefore,

∣∣ Φ(2)

n /Φ(1)n − tanα

∣∣ ≤∣∣∣∣

4y

Φ∗√

n/2− 2y

∣∣∣∣ ≤

8y

Φ∗√

n/2≤ 12 y

Φ∗√n. �

Proof of Proposition 14.8. In Lemma 14.13, put y = (Φ∗/12)x. Thenthe restrictions on y in this lemma turn into the bounds for x, c0 ≤ x ≤c1√n, where c0 = max(Cρ, Cv)(12/Φ∗) and c1 = 2

√2. If we take a0 =

(Φ∗/12)2/(2C2v ), then

Pf

( ∣∣ Φ(2)

n /Φ(1)n − tanα

∣∣ > x/√n)

≤ 4 exp{− a0x

2}.

Note that if |αn − α| > x/√n, then

| tan αn − tanα| = (cosα∗)−2 |αn − α| ≥ |αn − α| > x/

√n

Page 219: Mathematical Statistics - 213.230.96.51:8090

208 14. Dimension and Structure in Nonparametric Regression

where we applied the mean value theorem with some α∗ between αn and α.Thus,

Pf

(|αn − α| > x/

√n)

≤ Pf

( ∣∣ Φ(2)n /Φ(1)

n − tanα∣∣ > x/

√n)

≤ 4 exp{− a0x

2}. �

Proof of Lemma 14.10. (i) For every design knot (i/m, j/m), considera square, which we call pixel, [(i − 1)/m, i/m] × [(j − 1)/m, j/m]. Let T∗be the union of the pixels that lie strictly inside T , and let T ∗ be theminimum union of the pixels that contain T , that is, the union of pixelswhich intersections with T are non-empty. The diameter of each pixel is√2/m =

√2/n, and its area is equal to 1/n. That is why the number N(T∗)

of the pixels in T∗ is no less than 4n(H−√

2/n)(h∗n−√

2/n), and the number

N(T ∗) of the pixels in T ∗ does not exceed 4n(H+√

2/n)(h∗n+√

2/n). Since1/

√n = o(h∗n), we find that

1 ≤ lim infn→∞

N(T∗)4Hnh∗n

≤ lim supn→∞

N(T ∗)

4Hnh∗n≤ 1.

Due to the inequalities

N(T∗) ≤ N(T ) ≤ N(T ∗),

we can apply the squeezing theorem to conclude that the variable N(T ) alsohas the same limit,

limn→∞

N(T )

4Hnh∗n= 1.

(ii) As n → ∞ , for any k, l = 0, . . . , β − 1, we have that

1

N(T )

(G′G

)k, l

∼ 1

4Hnh∗n

(i/m, j/m)∈T

( ti j − t0h∗n

)k+l

∼ 1

4Hh∗n

∫ H

−H

∫ t0+h∗n

t0−h∗n

( t− t0h∗n

)k+ldt du =

1

2

∫ 1

−1zk+l dz =

1− (−1)k+l+1

2(k + l + 1).

The respective limiting matrix is invertible (see Exercise 9.66). �

Page 220: Mathematical Statistics - 213.230.96.51:8090

Exercises 209

Exercises

Exercise 14.92. Prove that the number of monomials of degree i in d-variables is equal to

(i+d−1i

).

Exercise 14.93. Show that in the additive regression model with the in-tercept (14.8), the preliminary estimator (14.9) and the shifted observations

yij − f0 allow us to prove the one-dimensional rate of convergence of thenonparametric components f1 and f2 as in Proposition 14.5.

Exercise 14.94. Let β1 and β2 be two positive integers, β1 = β2. A functionin two variables f(x), x = (x(1), x(2)), is said to belong to the anisotropicHolder class of functions Θ(β1, β2) = Θ(β1, β2, L, L1), if f is bounded by

L1, and if for any point x0 = (x(1)0 , x

(2)0 ), there exists a polynomial p(x) =

p(x,x0, f) such that

|f(x) − p(x, x0, f)| ≤ L(|x(1) − x

(1)0 |β1 + |x(2) − x

(2)0 |β2

)

where we assume that x and x0 belong to the unit square.

Suppose we want to estimate the value of the regression function f at agiven design knot (i0/m, j0/m) from the observations

yij = f(i/m, j/m) + εij , i, j = 1, . . . ,m, m =√n.

Show that if the regression function belongs to the anisotropic class

Θ(β1, β2), then there exists an estimator with the convergence rate n−β/(2β+1)

where β−1 = β−11 +β−1

2 . Hint: Consider a local polynomial estimator in the

bin with the sides h1 and h2 satisfying hβ11 = hβ2

2 . Show that the bias is

O(hβ11 ), and the variance is O

((nh1h2)

−1). Now use the balance equation

to find the rate of convergence.

Page 221: Mathematical Statistics - 213.230.96.51:8090
Page 222: Mathematical Statistics - 213.230.96.51:8090

Chapter 15

Adaptive Estimation

In Chapters 8-11, we studied a variety of nonparametric regression estima-tion problems and found the minimax rates of convergence for different lossfunctions. These rates of convergence depend essentially on the parameterof smoothness β. This parameter determines the choice of the optimal band-width. An estimator which is minimax optimal for one smoothness does notpreserve this property for another smoothness. The problem of adaptationconsists of finding,if possible, an adaptive estimator which is independent ofa particular β and is simultaneously minimax optimal over different non-parametric classes.

In this chapter, we will give examples of problems where the adaptive es-timators exist in the sense that over each class of smoothness, the regressionfunction can be estimated as if the smoothness parameter were known. Westart, however, with a counterexample of an estimation problem in whichthe minimax rates are not attainable.

15.1. Adaptive Rate at a Point. Lower Bound

Consider regression observations on [0, 1],

yi = f(i/n) + εi, i = 1, . . . , n,

where εi are standard normal random variables. Since the design is not thefocus of our current interest, we work with the simplest equidistant design.

We assume that the smoothness parameter can take on only two valuesβ1 and β2 such that 1 ≤ β1 < β2. Thus, we assume that the regression func-tion f belongs to one of the two Holder classes, either Θ(β1) = Θ(β1, L, L1)or Θ(β2) = Θ(β2, L, L1).

211

Page 223: Mathematical Statistics - 213.230.96.51:8090

212 15. Adaptive Estimation

Let x0 = i0/n be a given point in (0, 1) which coincides with a design

knot. We want to estimate f(x0) by a single estimator fn with the property

that if f ∈ Θ(β1), then the rate of convergence is n−β1/(2β1+1), while if

f ∈ Θ(β2), then the rate of convergence is n−β2/(2β2+1). Whether the truevalue of the smoothness parameter is β1 or β2 is unknown. The estimatorfn may depend on both β1 and β2 but the knowledge of the true β cannotbe assumed.

Formally, we introduce an adaptive risk of an estimator fn by

(15.1) AR(fn) = maxβ∈{β1, β2}

supf ∈Θ(β)

Ef

[n2β/(2β+1)

(fn − f(x0)

)2 ].

The question we want to answer is whether there exists an estimator fnsuch that AR(fn) ≤ r∗ for some constant r∗ < ∞ independent of n. Theobjective of this section is to demonstrate that such an estimator does notexist.

First, we define a class of estimators. For two given constants, A > 0and a such that β1/(2β1 + 1) < a ≤ β2/(2β2 + 1), we introduce a classof estimators that are minimax optimal or sub-optimal over the regressionfunctions of the higher smoothness,

F = F(A, a) ={fn : sup

f ∈Θ(β2)Ef

[n2a(fn − f(x0)

)2 ] ≤ A}.

As the following proposition claims, the estimators that belong to F cannotattain the minimax rate of convergence on the lesser smoothness of regressionfunctions.

Proposition 15.1. There exists a constant r∗ = r∗(A, a) independent of n

such that for any estimator fn ∈ F(A, a), the following lower bound holds:

(15.2) supf ∈Θ(β1)

Ef

[ (n/ lnn

)2β1/(2β1+1)(fn − f(x0)

)2 ] ≥ r∗ > 0.

Two important corollaries follow immediately from this result.

Corollary 15.2. The adaptive risk AR(fn) in (15.1) is unbounded for any

estimator fn. Indeed, take a = β2/(2β2 + 1) in the definition of the classF(A, a). Then we have that

supf ∈Θ(β2)

Ef

[n2β2/(2β2+1)

(fn − f(x0)

)2 ] ≤ A.

From Proposition 15.1, however, for all large n,

supf ∈Θ(β1)

Ef

[n2β1/(2β1+1)

(fn − f(x0)

)2 ] ≥ r∗(lnn)2β1/(2β1+1)

with the right-hand side growing unboundedly as n → ∞. Thus, the adaptiverisk, being the maximum of the two supremums, is unbounded as well.

Page 224: Mathematical Statistics - 213.230.96.51:8090

15.1. Adaptive Rate at a Point. Lower Bound 213

Corollary 15.3. The contrapositive statement of Proposition 15.1 is valid.It can be formulated as follows. Assume that there exists an estimator fn thatguarantees the minimax rate over the Holder class of the lesser smoothness,

supf ∈Θ(β1)

Ef

[n2β1/(2β1+1)

(fn − f(x0)

)2 ] ≤ r∗

with a constant r∗ independent of n. Then this estimator does not belong toF(A, a) for any a and A. As a consequence, from the definition of F(A, a)with a = β2/(2β2 + 1), we find that

supf ∈Θ(β2)

Ef

[n2β2/(2β2+1)

(fn − f(x0)

)2 ] → ∞ as n → ∞.

As Corollaries 15.2 and 15.3 explain, there is no adaptive estimator of aregression at a point. By this we mean that we cannot estimate a regressionfunction at a point as if its smoothness were known.

Define a sequence

(15.3) ψn = ψn(f) =

{(n/(lnn)

)−β1/(2β1+1)if f ∈ Θ(β1),

n− β2/(2β2+1) if f ∈ Θ(β2).

The next question we ask about the adaptive estimation of f(x0) iswhether it can be estimated with the rate ψn(f). The answer to this questionis positive. We leave it as an exercise (see Exercise 15.95).

The rest of this section is devoted to the proof of Proposition 15.1.

Proof of Proposition 15.1. Define two test functions f0(x) = 0 and

f1(x) = hβ1n ϕ( x− x0

hn

)with hn =

( c lnn

n

)1/(2β1+1).

The choice of constant c will be made below. This definition is explainedin detail in the proof of Theorem 9.16. In particular, f1 ∈ Θ(β1, L, L1) forsome constants L and L1.

Choose a constant a0 such that β1/(2β1+1) < a0 < a. Define a sequence

un = n2a0h2β1n = n2a0

( c lnn

n

)2β1/(2β1+1)

(15.4) =(c lnn

)2β1/(2β1+1)n2[a0−β1/(2β1+1)] ≥ n2[a0−β1/(2β1+1)]

for any fixed c and all large enough n, so that un → ∞ at the power rateas n → ∞. Take an arbitrarily small δ such that δ < ϕ2(0)/4. Note that if

fn ∈ F(A, a), then for all sufficiently large n, we have

unEf0

[h−2β1n f2

n

]= unEf0

[u−1n n2a0 f2

n

]

= n−2(a−a0)Ef0

[n2af2

n

]≤ n−2(a−a0)A < δ.

Page 225: Mathematical Statistics - 213.230.96.51:8090

214 15. Adaptive Estimation

Thus, unEf0

[h−2β1n f2

n

]− δ < 0 . Put Tn = h−β1

n fn. We obtain

supf ∈Θ(β1)

Ef

[h−2β1n

(fn − f(x0)

)2 ] ≥ Ef1

[h−2β1n

(fn − f1(x0)

)2 ]

≥ Ef1

[h−2β1n

(fn − f1(x0)

)2 ]+ unEf0

[h−2β1n f2

n

]− δ

= Ef1

[(Tn − ϕ(0))2

]+ unEf0

[T 2n

]− δ.

Finally, we want to show that the right-hand side is separated away fromzero by a positive constant independent of n. Introduce the likelihood ratio

Λn =dPf0

dPf1

= exp{−

n∑

i=1

f1(i/n)ξi −1

2

n∑

i=1

f21 (i/n)

}

where ξi = yi − f1(i/n), i = 1, . . . , n, are independent standard normal ran-dom variables with respect to Pf1-distribution. As in the proof of Theorem9.16, we get

σ2n =

n∑

i=1

f21 (i/n) = ‖ϕ‖22 nh2β1+1

n

(1 + on(1)

)= ‖ϕ‖22 (c lnn)

(1 + on(1)

)

where on(1) → 0 as n → ∞. Introduce a standard normal random variableZn = σ−1

n

∑ni=1 f1(i/n) ξi. Then the likelihood ratio takes the form

Λn = exp{− σn Zn − 1

2‖ϕ‖22 (c lnn)

(1 + on(1)

)}.

Note that if the random event {Zn ≤ 0 } holds, then

Λn ≥ exp{− 1

2‖ϕ‖22 (c lnn)

(1 + on(1)

)}≥ n−c1 ,

for all large n, where c1 = c ‖ϕ‖22. From the definition of the likelihood ratio,we obtain the lower bound

supf ∈Θ(β1)

Ef

[h−2β1n

(fn − f(x0)

)2 ] ≥ Ef1

[(Tn − ϕ(0))2 + un Λn T

2n

]− δ

≥ Ef1

[(Tn − ϕ(0))2 + un n

−c1 T 2n I(Zn ≤ 0)

]− δ.

Now we choose c so small that c1 = c ‖ϕ‖22 < 2(a0 − β1/(2β1 + 1)

). Then,

by (15.4), un n−c1 increases and exceeds 1 if n is sufficiently large. Hence,

supf ∈Θ(β1)

Ef

[h−2β1n

(fn−f(x0)

)2 ] ≥ Ef1

[(Tn−ϕ(0))2 + T 2

n I(Zn ≤ 0)]− δ

≥ Ef1

[I(Zn ≤ 0)

((Tn − ϕ(0))2 + T 2

n

) ]− δ

≥ Ef1

[I(Zn ≤ 0)

((−ϕ(0)/2)2 + (ϕ(0)/2)2

) ]− δ

≥ 1

2ϕ2(0)Pf1

(Zn ≤ 0

)− δ =

1

4ϕ2(0) − δ = r∗

Page 226: Mathematical Statistics - 213.230.96.51:8090

15.2. Adaptive Estimator in the Sup-Norm 215

where r∗ is strictly positive because, under our choice, δ < ϕ2(0) / 4. �

15.2. Adaptive Estimator in the Sup-Norm

In this section we present a result on adaptive estimation in the sup-norm.We will show that for the sup-norm losses, the adaptation is possible in thestraightforward sense: the minimax rates are attainable as if the smoothnessparameter were known.

First, we modify the definition (15.1) of the adaptive risk to reflect thesup-norm loss function,

(15.5) AR∞(fn) = maxβ ∈{β1, β2}

supf ∈Θ(β)

( n

lnn

)β/(2β+1)Ef

[‖fn − f ‖∞

].

Since the log-factor is intrinsic to the sup-norm rates of convergence,there is no need to prove the lower bound. All we have to do is to definean adaptive estimator. As in the previous section, we proceed with theequidistant design and the standard normal errors in the regression model,

yi = f(i/n) + εi, εiiid∼ N (0, 1), i = 1, . . . , n.

Define f∗n, β1

and f∗n, β2

as the rate-optimal estimators in the sup-norm

over the classes Θ(β1) and Θ(β2), respectively. Each estimator is based

on the regressogram with the bandwidth h∗n, β =((lnn)/n

)1/(2β+1), β ∈

{β1, β2} (see Section 10.3.)

Now we introduce an adaptive estimator,

(15.6) fn =

{f∗n, β1

if ‖f∗n, β1

− f∗n, β2

‖∞ ≥ C(h∗n, β1

)β1 ,

f∗n, β2

otherwise,

where C is a sufficiently large constant which depends only on β1 and β2.The final choice of C is made below.

Our starting point is the inequality (10.10). Since the notations of thecurrent section are a little bit different, we rewrite this inequality in theform convenient for reference,

(15.7)∥∥ f∗

n, β − f∥∥∞ ≤ Ab (h

∗n, β)

β + Av

(nh∗n, β

)−1/2Z∗β, β ∈ {β1, β2},

where Ab and Av are constants independent of n. Using the notation definedin Section 10.3, we show that the distribution of Z∗

β has fast-decreasing tailprobabilities,

P(Z∗β ≥ y

√2β2 lnn

)≤ P

( Q⋃

q=1

β−1⋃

m=,0

∣∣Zm, q

∣∣ ≥ y√2 lnn

)

≤ QβP(|Z | ≥ y

√2 lnn

)

Page 227: Mathematical Statistics - 213.230.96.51:8090

216 15. Adaptive Estimation

where Z ∼ N (0, 1). Now, since P(|Z| ≥ x) ≤ exp{−x2/2} for any x ≥ 1, wearrive at the upper bound

(15.8) P(Z∗β ≥ y

√2β2 lnn

)≤ Qβn−y2 ≤ β n−(y2−1).

Here we have used the fact that the number of bins Q = 1/(2h∗n, β) ≤ n forall large enough n.

Theorem 15.4. There exists a constant C in the definition (15.6) of the

adaptive estimator fn such that the adaptive risk AR∞(fn) specified by (15.5)satisfies the upper bound

(15.9) AR∞(fn) ≤ r∗

with a constant r∗ independent of n.

Proof. Denote the random event in the definition of the adaptive estimatorfn by

C ={‖ f∗

n, β1− f∗

n, β2‖∞ ≥ C

(h∗n, β1

)β1}.

If f ∈ Θ(β1), then(n/(lnn)

)β1/(2β1+1)Ef

[‖fn − f‖∞

]

(15.10)

≤(h∗n,β1

)−β1Ef

[‖f∗

n,β1− f‖∞I(C)

]+(h∗n,β1

)−β1Ef

[‖f∗

n,β2− f‖∞I(C)

]

where C is the complementary random event to C. The first term on theright-hand side of (15.10) is bounded from above uniformly in f ∈ Θ(β1)because f∗

n, β1is the minimax rate-optimal estimator over this class. If the

complementary random event C holds, then by the triangle inequality, thesecond term does not exceed

(15.11)(h∗n, β1

)−β1[Ef

[‖f∗

n, β1− f‖∞

]+ C

(h∗n, β1

)β1]

which is also bounded from above by a constant.

Next, we turn to the case f ∈ Θ(β2). As above,

(n/(lnn)

)β2/(2β2+1)Ef

[‖fn − f‖∞

]

≤(h∗n, β2

)−β2Ef

[‖f∗

n, β2− f‖∞ I(C)

]+(h∗n, β2

)−β2Ef

[‖f∗

n, β1− f‖∞ I(C)

].

Once again, it suffices to study the case when the estimator does not matchthe true class of functions,

(h∗n, β2

)−β2Ef

[‖f∗

n, β1− f‖∞ I(C)

]

(15.12) ≤ vn

[ (h∗n, β1

)−2β1Ef

[‖f∗

n, β1− f‖2∞

] ]1/2 [Pf

(C) ]1/2

Page 228: Mathematical Statistics - 213.230.96.51:8090

15.2. Adaptive Estimator in the Sup-Norm 217

where the Cauchy-Schwarz inequality was applied. The deterministic se-quence vn is defined by

vn =(h∗n, β2

)−β2(h∗n, β1

)β1 =( n

lnn

)γ, γ =

β22β2 + 1

− β12β1 + 1

> 0.

The normalized expected value on the right-hand side of (15.12) isbounded from above uniformly over f ∈ Θ(β2). Indeed, over a smootherclass of functions Θ(β2), a coarser estimator f∗

n, β1preserves its slower rate

of convergence. Formally, this bound does not follow from Theorem 10.6because of the squared sup-norm which is not covered by this theorem.However, the proper extension is elementary if we use (15.7) directly (seeExercise 15.96.)

Hence, it remains to show that the probability Pf

(C)in (15.12) vanishes

fast enough to compensate the growth of vn. From the definition of therandom event C and the triangle inequality, we have

C ⊆{‖f∗

n, β1− f‖∞ ≥ 1

2C(h∗n, β1

)β1}∪{‖f∗

n, β2− f‖∞ ≥ 1

2C(h∗n, β1

)β1}.

Now, note that the bias terms in (15.7) are relatively small,

Ab (h∗n, β2

)β2 < Ab (h∗n, β1

)β1 <1

4C (h∗n, β1

)β1

if the constant C in the definition of the adaptive estimator fn satisfies thecondition C > 4Ab. Under this choice of C, the random event C may occuronly due to the large deviations of the stochastic terms. It implies thatC ⊆ A1 ∪ A2 with

A1 ={Av

(nh∗n, β1

)−1/2Z∗β1

≥ 1

4C (h∗n, β1

)β1}

={Z∗β1

≥ C

4Av

√lnn

}

and

A2 ={Av

(nh∗n, β2

)−1/2Z∗β2

≥ 1

4C (h∗n , β1

)β1}

⊆{Av

(nh∗n, β1

)−1/2Z∗β2

≥ 1

4C (h∗n, β1

)β1}

={Z∗β2

≥ C

4Av

√lnn

}.

From the inequality (15.8), it follows that for a large C, the probabili-ties of the random events A1 and A2 decrease faster than any power of n.Choosing C > 4Av

√2β2

2(1 + 2γ), we can guarantee that the right-handside of (15.12) vanishes as n → ∞. �

Remark 15.5. The definition of the adaptive estimator fn and the proofof Proposition 15.4 contain a few ideas common to selection of adaptive es-timators in different nonparametric models. First, we choose an estimatorfrom minimax optimal estimators over each class of functions. Second, wefocus on the performance of the chosen adaptive estimator over the alienclass, provided the choice has been made incorrectly. Third, we use the fact

Page 229: Mathematical Statistics - 213.230.96.51:8090

218 15. Adaptive Estimation

that this performance is always controlled by the probabilities of large devi-ations that vanish faster than their normalization factors that are growingat a power rate. �

15.3. Adaptation in the Sequence Space

Another relatively less technical example of adaptation concerns the adap-tive estimation problem in the sequence space. Recall that the sequencespace, as defined in Section 10.5, is the n-dimensional space of the Fouriercoefficients of regression function. We assume that each regression func-tion f(x), 0 ≤ x ≤ 1, is observed at the equidistant design points x = i/n.This function is defined in terms of its Fourier coefficients ck and the basisfunctions ϕk by the formula

f(i/n) =n−1∑

k=0

ckϕk(i/n), i = 1, . . . , n.

The transition from the original observations of the regression function f tothe sequence space is explained in Lemma 10.16 (see formula (10.31)).

To ease the presentation, we will consider the following model of obser-vations directly in the sequence space,

(15.13) zk = ck + ξk/√n and zk = ck + ξk/

√n, k = 0, . . . , n− 1,

where ξk and ξk are independent standard normal random variables. Thatis, we assume that each observation of the Fourier coefficient ck is observedtwice and the repeated observations are independent.

By Lemma 10.15, for any estimator c = (c0, . . . , cn−1) of the Fouriercoefficients c = (c0, . . . , cn−1), the quadratic risk Rn(c, c) in the sequencespace is equivalent to the quadratic risk of regression, that is,

(15.14) Rn(c, c) = Ec

[ n−1∑

k=0

(ck − ck)2]= Ef

[‖

n−1∑

k=0

ck ϕk − f‖22, n]

where Ec refers to the expectation for the true Fourier coefficients c =(c0, . . . , cn−1), and the discrete L2-norm is defined as

‖f‖22, n = n−1n∑

i=1

f2(i/n).

Next, we take two integers β1 and β2 such that 1 ≤ β1 < β2, and considertwo sets in the sequence space

Θ2,n(β) = Θ2,n(β, L) ={(c0, . . . , cn−1) :

n−1∑

k=0

c2k k2β ≤ L

}, β ∈ {β1, β2 }.

Page 230: Mathematical Statistics - 213.230.96.51:8090

15.3. Adaptation in the Sequence Space 219

We associate Θ2,n(β) with the smoothness parameter β because the decreaserate of Fourier coefficients controls the smoothness of the original regressionfunction (cf. Lemma 10.13.)

As shown in Theorem 10.17, for a known β, uniformly in c ∈ Θ2,n(β),

the risk Rn(c, c) = O(n−2β/(2β+1)) as n → ∞. The rate-optimal estimatoris the projection estimator which can be defined as

c =(z0, . . . , zM , 0, . . . , 0

)

where M = Mβ = n1/(2β+1). In other words, ck = zk if k = 0, . . . ,M, andck = 0 for k ≥ M + 1.

Now, suppose that we do not know the true smoothness of the regressionfunction, or, equivalently, suppose that the true Fourier coefficients may be-long to either class, Θ2,n(β1) or Θ2,n(β2). Can we estimate these coefficientsso that the optimal rate would be preserved over either class of smoothness?To make this statement more precise, we redefine the adaptive risk for se-quence space. For any estimator c = (c0, . . . , cn−1) introduce the adaptivequadratic risk by

(15.15) AR(c) = maxβ ∈{β1, β2}

supc∈Θ2, n(β)

(Mβ)2β

Ec

[ n−1∑

k=0

(ck − ck)2]

where Mβ = n1/(2β+1). The objective is to find an adaptive estimator c thatkeeps the risk AR(c) bounded from above by a constant independent of n.To this end, introduce two estimators, each optimal over its own class,

cβ =(c0, β, . . . , cn−1, β

)=(z0, . . . , zMβ

, 0, . . . , 0), β ∈ {β1, β2}.

Further, define two statistics designed to mimic the quadratic risks of therespective estimators cβ,

Rβ =n−1∑

k=0

(zk − ck, β

)2, β ∈ {β1, β2}.

These statistics are based on the second set of the repeated observations zkin (15.13). From the definition of the quadratic risk (15.14), we have

Ec

[Rβ

]= Ec

[ n−1∑

k=0

(ck + ξk/

√n− ck, β

)2 ]= Rn(cβ, c) + 1.

Next, we give a natural definition of the adaptive estimator in our set-ting. The adaptive estimator is the estimator cβ that minimizes the riskRβ, that is,

c =

{cβ1 if Rβ1 ≤ Rβ2 ,

cβ2 if Rβ1 > Rβ2 .

Page 231: Mathematical Statistics - 213.230.96.51:8090

220 15. Adaptive Estimation

We give the proof of the following theorem at the end of the presentsection after we formulate some important auxiliary results.

Theorem 15.6. There exists a constant r∗ independent of n and such thatthe adaptive risk (15.15) is bounded from above,

AR(c) ≤ r∗.

We have to emphasize that Remark 15.5 stays valid in this case as well.We have to understand the performance of the adaptive estimator if thecorrect selection fails. As we will show, this performance is governed by thelarge deviations probabilities of the stochastic terms. Before we prove thetheorem, let us analyze the structure of the difference ΔR = Rβ1 −Rβ2 thatcontrols the choice of the adaptive estimator. Put

M ={k : Mβ2 + 1 ≤ k ≤ Mβ1

}and ΔM = Mβ1 −Mβ2 = Mβ1

(1 + on(1)

).

The following technical lemmas are proved in the next section.

Lemma 15.7. The difference of the risk estimates satisfies the equation

ΔR = Rβ1 −Rβ2 = −Sn + Mβ1

(1 + on(1)

)/n + U (1)

n /n − 2U (2)n /

√n

with Sn = Sn(c) =∑

k∈M c2k, and the random variables

U (1)n =

k∈M

(ξ2k − 1

)and U (2)

n =∑

k∈Mzkξk.

The following random events help to control the adaptive risk

A1 ={U (1)n ≥ Mβ1

}, A2 =

{U (2)n ≤ −

√nSn/8

},

A3 ={U (1)n ≤ −Mβ1/4

}, and A4 =

{U (2)n ≥ Mβ1/(8

√n)}.

Lemma 15.8. (i) If the inequality Sn > 4Mβ1/n holds, then

Pc(Ai ) ≤ exp{− AiMβ1

}, i = 1 or 2,

where A1 and A2 are positive constants independent of n.

(ii) If Sn = o(Mβ1/n) as n → ∞, then

Pc(Ai ) ≤ exp{− AiMβ1

}, i = 3 or 4,

with some positive constants A3 and A4.

Proof of Theorem 15.6. As explained in the proof of Proposition 15.1, wehave to understand what happens with the risk, if the adaptive estimator ischosen incorrectly, that is, if it does not coincide with the optimal estimatorover the respective class. Let us start with the case c ∈ Θ2,n(β1), while

Page 232: Mathematical Statistics - 213.230.96.51:8090

15.3. Adaptation in the Sequence Space 221

c = cβ2 . The contribution of this instance to the adaptive risk (15.15) isequal to

(Mβ1)2β1 Ec

[I(ΔR > 0)

n−1∑

k=0

(ck, β2 − ck

)2 ]

= (Mβ1)2β1 Ec

[I(ΔR > 0)

( Mβ2∑

k=0

(zk − ck)2 +

n−1∑

k=Mβ2+1

c2k

) ]

= (Mβ1)2β1 Ec

[I(ΔR > 0)

( 1

n

Mβ2∑

k=0

ξ2k + Sn +n−1∑

k=Mβ1+1

c2k

) ]

where Sn is defined in Lemma 15.7. Note that

Ec

[ 1n

Mβ2∑

k=0

ξ2k

]=

Mβ2

n� Mβ1

n=(Mβ1

)−2β1 ,

and since c ∈ Θ2,n(β1),

n−1∑

k=Mβ1+1

c2k ≤ L(Mβ1)−2β1 .

Thus, even multiplied by (Mβ1)2β1 , the respective terms in the risk stay

bounded as n → ∞.

It remains to verify that the term Sn (Mβ1)2β1 Pc(ΔR > 0) also stays

finite as n → ∞. It suffices to study the case when Sn > 4 (Mβ1)−2β1 =

4Mβ1/n, because otherwise this term would be bounded by 4. From Lemma15.7,

{ΔR > 0} ={− Sn + Mβ1

(1 + on(1)

)/n + U (1)

n /n − 2U (2)n /

√n > 0

}

can occur if at least one of the two random eventsA1 ={U

(1)n /n ≥ Mβ1/n

}

or A2 ={− 2U

(2)n /

√n ≥ Sn/4

}occurs. Indeed, otherwise we would have

the inequality

ΔR < −(3/4)Sn + 2Mβ1 (1 + on(1))/n < 0,

since by our assumption, Sn > 4Mβ1/n.

Lemma 15.8 part (i) claims that the probabilities of the random events

A1 and A2 decrease faster than exp{− An1/(2β1+1)

}as n → ∞, which

implies that Sn (Mβ1)2β1 Pc(ΔR > 0) is finite.

The other case, when c ∈ Θ2, n(β2) and c = cβ1 , is treated in a similarfashion, though some calculations change. We write

(Mβ2)2β2 Ec

[I(ΔR ≤ 0)

n−1∑

k=0

(ck, β1 − ck)2]

Page 233: Mathematical Statistics - 213.230.96.51:8090

222 15. Adaptive Estimation

(15.16) = (Mβ2)2β2 Ec

[I(ΔR ≤ 0)

( 1

n

Mβ1∑

k=0

ξ2k +n−1∑

k=Mβ1+1

c2k

) ].

Since c ∈ Θ2,n(β2),

(Mβ2)2β2

n−1∑

k=Mβ1+1

c2k ≤ L(Mβ2/Mβ1

)2β2 → 0, as n → ∞.

It remains to verify that the first term in (15.16) is bounded. We obtain

(Mβ2)2β2 Ec

[I(ΔR ≤ 0)

( 1

n

Mβ1∑

k=0

ξ2k

) ]

≤ (Mβ2)2β2 E

1/2c

[ ( 1

n

Mβ1∑

k=0

ξ2k

)2 ]P1/2c (ΔR ≤ 0)

≤ (Mβ2)2β2

( 2Mβ1

n

)P1/2c (ΔR ≤ 0) = 2nγ

P1/2c (ΔR ≤ 0).

Here we applied the Cauchy-Schwartz inequality, and the elementary calcu-lations Ec

[ξ4k]= 3, hence,

Ec

[ ( Mβ1∑

k=0

ξ2k

)2 ]=

Mβ1∑

k=0

Ec

[ξ4k]+

Mβ1∑

k, l=0k =l

Ec

[ξ2k ξ

2l

]

= 3Mβ1 + Mβ1 (Mβ1 − 1) ≤ 4M2β1.

The constant γ in the exponent above is equal to

γ =2β2

2β2 + 1+

2

2β1 + 1− 1 =

2β22β2 + 1

− 2β12β1 + 1

> 0.

If c ∈ Θ2,n(β2), then Sn ≤ L (Mβ2)−2β2 = LMβ2/n = o

(Mβ1/n

)as n →

∞. Note that the random event

{ΔR ≤ 0} ={− Sn + Mβ1 (1 + on(1))/n + U (1)

n /n − 2U (2)n /

√n ≤ 0

}

={− U (1)

n /n + 2U (2)n /

√n ≥ Mβ1 (1 + on(1))/n

}

occurs if either A3 ={− U

(1)n ≥ Mβ1/4

}or A4 =

{U

(2)n ≥ Mβ1/(8

√n)}

occurs. Again, as Lemma 15.8 (ii) shows, the probabilities of these random

events decrease faster than n2γ , so that nγP1/2c (ΔR ≤ 0) → 0 as n → ∞,

and the statement of the theorem follows. �

Page 234: Mathematical Statistics - 213.230.96.51:8090

15.4. Proofs of Lemmas 223

15.4. Proofs of Lemmas

Proof of Lemma 15.7. By straightforward calculations, we obtain

ΔR = Rβ1 − Rβ2 =∑

k∈M(zk − zk)

2 −∑

k∈Mz2k

=1

n

k∈M

(ξk − ξk

)2 −[ ∑

k∈Mc2k +

2√n

k∈Mck ξk +

1

n

k∈Mξ2k

]

= − 2

n

k∈Mξk ξk +

1

n

k∈Mξ2k −

k∈Mc2k − 2√

n

k∈Mck ξk

= −∑

k∈Mc2k +

1

nΔM +

1

n

k∈M(ξ2k − 1) − 2√

n

k∈M

(ck +

ξk√n

)ξk

where ΔM = Mβ2 − Mβ1 = Mβ1

(1 + on(1)

)is the number of elements in

M. So, the lemma follows. �To prove Lemma 15.8 we need the following result.

Proposition 15.9. The moment generating functions of the random vari-

ables U(1)n and U

(2)n are respectively equal to

G1(t) = E

[exp

{t U (1)

n

} ]= exp

{− Mβ1

(1+on(1)

) (t+(1/2) ln(1−2t)

)}

and

G2(t) = E

[exp

{t U (2)

n

} ]= exp

{ nt2

2(n− t2)Sn − ΔM

2ln(1− t2/n)

}.

Proof. Note that E[exp{tξ2}

]= (1− 2t)−1/2, t ≤ 1/2, where ξ ∼ N (0, 1).

Therefore,

E[exp{t (ξ2 − 1)}

]= exp

{− t− (1/2) ln(1− 2t)

},

and the expression for G1(t) follows from independence of the random vari-ables ξ2k.

Next, the moment generating function of U(2)n can be expressed as

G2(t) = E

[exp

{t U (2)

n

} ]= E

[exp

{t∑

k∈M

(ck +

ξk√n

)ξk} ]

= E

[E

[exp

{t∑

k∈M

(ck +

ξk√n

)ξk} ∣∣∣ ξk, k ∈ M

] ]

=∏

k∈ME

[exp

{(t2/2n)( ck

√n+ ξk)

2} ]

.

Now, for any real a < 1 and any b, we have the formula

E[exp

{(a/2) (b+ ξ)2

} ]= (1− a)−1/2 exp

{ab2/(2(1− a))

}.

Page 235: Mathematical Statistics - 213.230.96.51:8090

224 15. Adaptive Estimation

Applying this formula with a = t2/n and b = ck√n, we obtain

G2(t) =∏

k∈Mexp

{ n t2

2(n− t2)c2k − 1

2ln(1− t2/n)

}

which completes the proof because Sn =∑

k∈M c2k. �

Proof of Lemma 15.8. All the inequalities in this lemma follow from theexponential Chebyshev inequality (also known as Chernoff’s inequality),

P(U ≥ x) ≤ G(t) exp{−t x}where G(t) = E

[exp{t U}

]is the moment generation function of a random

variable U.

It is essential that the moment generating functions of the random

variables U(1)n and U

(2)n in Proposition 15.9 are quadratic at the origin,

Gi(t) = O(t2), i = 1, 2, as t → 0. A choice of a sufficiently small t wouldguarantee the desired bounds. In the four stated inequalities, the choices oft differ.

We start with the random event A1 ={U

(1)n ≥ Mβ1

},

Pc(A1 ) ≤ G1(t) exp{− tMβ1

}

= exp{− Mβ1

(1 + on(1)

)(t+ (1/2) ln(1− 2t)

)− tMβ1

}.

We choose t = 1/4 and obtain

Pc(A1 ) ≤ exp{− (1/2)(1− ln 2)Mβ1

(1 + on(1)

)}≤ exp

{− 0.15Mβ1

}.

Similarly, if we apply Chernoff’s inequality to the random variable −U(2)n

with t =√n/10, and use the fact that ΔM < Mβ1 ≤ nSn/4, we get

Pc(A2 ) = Pc

(− U (2)

n ≥√nSn/8

)

≤ exp{ n t2

2(n− t2)Sn − ΔM

2ln(1− t2/n) − t

√nSn/8

}

= exp{nSn

198− ΔM

2ln(99/100) − nSn

80

}

≤ exp{nSn

198− nSn

8ln(99/100) − nSn

80

}

≤ exp{−AnSn

}≤ exp

{− 4AMβ1

}

where A = −1/198 + (1/8) ln(99/100) + 1/80 > 0.

Page 236: Mathematical Statistics - 213.230.96.51:8090

Exercises 225

To prove the upper bound for the probability of A3, take t = 1/8. Then

Pc(A3 ) = Pc

(− U (1)

n ≥ Mβ1/4)

≤ exp{−Mβ1

(1 + on(1)

)(− t+ (1/2) ln(1 + 2t)

)− tMβ1/4

}

= exp{−AMβ1

(1 + on(1)

)}

where A = −1/8 + (1/2) ln(5/4) + 1/32 > 0.

Finally, if nSn = o(Mβ1), then

G2(t) = exp{− (1/2)Mβ1

(1 + on(1)

)ln(1− t2/n)

}.

Put t =√n/8. Then

Pc(A4 ) = Pc

(U (2)n ≥ Mβ1/(8

√n))

≤ exp{− (1/2)Mβ1

(1 + on(1)

)ln(1− t2/n)− tMβ1/(8

√n)}

= exp{−AMβ1

(1 + on(1)

)}

where A = (1/2) ln(63/64) + 1/64 > 0. �

Exercises

Exercise 15.95. Let ψn = ψn(f) be the rate defined by (15.3). Show that

there exists an estimator fn and a constant r∗ independent of n such that

maxβ ∈{β1, β2}

supf ∈Θ(β)

Ef

[ ∣∣∣fn − f(x0)

ψn(f)

∣∣∣]≤ r∗.

Exercise 15.96. Use (15.7) to prove that the expectation in (15.12) isbounded, that is, show that uniformly in f ∈ Θ(β2), the following inequalityholds:

(h∗n, β1)−2β1Ef

[‖f∗

n, β1− f‖2∞

]≤ r∗

where a constant r∗ is independent of n.

Page 237: Mathematical Statistics - 213.230.96.51:8090
Page 238: Mathematical Statistics - 213.230.96.51:8090

Chapter 16

Testing ofNonparametricHypotheses

16.1. Basic Definitions

16.1.1. Parametric Case. First of all, we introduce the notion of para-metric hypotheses testing. Suppose that in a classical statistical model withobservations X1, . . . , Xn that obey a probability density p(x, θ), θ ∈ Θ ⊆ R,we have to choose between two values of the parameter θ. That is, we wantto decide whether θ = θ0 or θ1, where θ0 and θ1 are known. For simplicitywe assume that θ0 = 0 and θ1 = 0. Our primary hypothesis, called the nullhypothesis, is written as

H0 : θ = 0,

while the simple alternative hypothesis has the form

H1 : θ = θ1.

In testing the null hypothesis against the alternative, we do not estimatethe parameter θ . A substitution for an estimator is a decision rule Δn =Δn(X1, . . . , Xn) that takes on only two values, for example, 0 or 1. The caseΔn = 0 is interpreted as acceptance of the null hypothesis, whereas the caseΔn = 1 means rejection of the null hypothesis in favor of the alternative.

The appropriate substitution for the risk function is the error probability.Actually, there are probabilities of two types of errors. Type I error iscommitted when a true null hypothesis is rejected, whereas acceptance of a

227

Page 239: Mathematical Statistics - 213.230.96.51:8090

228 16. Testing of Nonparametric Hypotheses

false null results in type II error. The respective probabilities are denotedby P0

(Δn = 1

)and Pθ1

(Δn = 0

).

The classical optimization problem in hypotheses testing consists of find-ing a decision rule that minimizes the type II error, provided the type I errordoes not exceed a given positive number α,

Pθ1

(Δn = 0

)→ inf

Δn

subject to P0

(Δn = 1

)≤ α.

If n is large, then a reasonable anticipation is that α can be chosen small,that is, α → 0 as n → ∞. This criterion of optimality is popular because ofits elegant solution suggested by the fundamental Neyman-Pearson lemma(see Exercise 16.97).

A more sophisticated problem is to test the null hypothesis against acomposite alternative,

H1 : θ ∈ Λn

where Λn is a known set of the parameter values that does not include theorigin, that is, 0 ∈ Λn. In our asymptotic studies, different criteria for findingthe decision rule are possible. One reasonable criterion that we choose isminimization of the sum of the type I error probability and the supremumover θ ∈ Λn of the type II error probability,

rn(Δn) = P0

(Δn = 1

)+ sup

θ∈Λn

(Δn = 0

)→ inf

Δn

.

The key question in asymptotic studies is: How distant should Λn befrom the origin, so that it is still possible to separate H0 from the alternativeH1 with a high probability? By separation between hypotheses we mean thatthere exists a decision rule Δ∗

n such that the sum of the error probabilitiesrn(Δ

∗n) is vanishing, limn→∞ rn(Δ

∗n) = 0.

16.1.2. Nonparametric Case. Our objective here is to extend the para-metric hypotheses testing to the nonparametric setup. We replace the pa-rameter θ by a regression function f from a Holder class Θ(β) = Θ(β, L, L1),and consider the model of observations,

yi = f(i/n) + εi where εiiid∼ N (0, σ2).

Suppose that we want to test H0 : f = 0 against the composite alternativeH1 : f ∈ Λn where the set of regression functions Λn is specified.

A general approach to the nonparametric hypotheses testing is as follows.Assume that a norm ‖f‖ of the regression function is chosen. Let ψn be adeterministic sequence, ψn → 0 as n → ∞, which plays the same role asthe rate of convergence in estimation problems. Define the set of alternativehypotheses

(16.1) Λn = Λn(β,C, ψn) ={f : f ∈ Θ(β) and ‖f‖ ≥ C ψn

}

Page 240: Mathematical Statistics - 213.230.96.51:8090

16.2. Separation Rate in the Sup-Norm 229

with a positive constant C. Denote the corresponding sum of the error prob-abilities by

rn(Δn, β, C, ψn) = P0

(Δn = 1

)+ sup

f ∈Λn(β,C,ψn)Pf

(Δn = 0

).

We call the sequence ψn a minimax separation rate if (i) for any smallpositive γ, there exist a constant C∗ and a decision rule Δ∗

n such that

(16.2) lim supn→∞

rn(Δ∗n, β, C

∗, ψn) ≤ γ,

and (ii) there exist positive constants C∗ and r∗, independent of n and suchthat for any decision rule Δn,

(16.3) lim infn→∞

rn(Δn, β, C∗, ψn) ≥ r∗.

The meaning of this definition is transparent. The regression functionswith the norm satisfying ‖f‖ ≥ C∗ ψn can be tested against the zero re-gression function with however small prescribed error probabilities. On theother hand, the reduction of the constant below C∗ holds the sum of errorprobabilities above r∗ for any sample size n.

16.2. Separation Rate in the Sup-Norm

In general, estimation of regression function and testing of hypotheses (in thesame norm) are two different problems. The minimax rate of convergence isnot necessarily equal to the minimax separation rate. We will demonstratethis fact in the next section. For some norms, however, they are the same.In particular, it happens in the sup-norm.

The following result is not difficult to prove because all the preliminarywork is already done in Section 12.1.

Theorem 16.1. Assume that the norm in the definition of Λn is the sup-norm,

Λn = Λn(β,C, ψn) ={f : f ∈ Θ(β) and ‖f‖∞ ≥ C ψn

}.

Then the minimax separation rate coincides with the minimax rate of con-

vergence in the sup-norm ψn =((lnn)/n

)β/(2β+1).

Proof. First, we prove the existence of the decision rule Δ∗n with the claimed

separation rate such that (16.2) holds. Let f∗n be the regressogram with the

rate-optimal bandwidth h∗n =((lnn)/n

)1/(2β+1). Our starting point is the

inequalities (15.7) and (15.8). For any C > Ab + 2βAv, uniformly overf ∈ Θ(β), these inequalities yield

Pf

( ∥∥f∗n − f

∥∥∞ ≥ C ψn

)≤ Pf

(Ab(h

∗n)

β +Av(nh∗n)

−1/2Z∗β ≥ C (h∗n)

β)

Page 241: Mathematical Statistics - 213.230.96.51:8090

230 16. Testing of Nonparametric Hypotheses

(16.4) = Pf

(Ab +Av Z∗

β/√lnn ≥ C

)≤ Pf

(Z∗β ≥ 2β

√lnn

)≤ β n−1

where we have applied (15.8) with y =√2. Put C∗ = 2C, and define the

set of alternatives by

Λn(β,C∗, ψn) =

{f : f ∈ Θ(β) and ‖f‖∞ ≥ C∗ ψn

}.

Introduce a rate-optimal decision rule

Δ∗n =

{0, if ‖f∗

n‖∞ < 12 C

∗ ψn,

1, otherwise.

Then, from (16.4), we obtain that as n → ∞,

P0

(Δ∗

n = 1)= P0

(‖f∗

n‖∞ ≥ 1

2C∗ ψn

)= P0

(‖f∗

n‖∞ ≥ C ψn

)→ 0.

Next, for any f ∈ Λn(β,C∗, ψn), by the triangle inequality, as n → ∞,

Pf

(Δ∗

n = 0)= Pf

(‖f∗

n‖∞ <1

2C∗ ψn

)≤ Pf

(‖f∗

n − f‖∞ ≥ C ψn

)→ 0.

Hence (16.2) is fulfilled for any γ > 0.

The proof of the lower bound (16.3) is similar to that in Lemma 12.2. Werepeat the construction of the Q test functions fq, q = 1, . . . , Q, in (12.3)based on a common “bump” function ϕ . For any test Δn, introduce therandom event D = {Δn = 1 }. Then for any δ > 0,

P0

(D)+ max

1≤ q≤QPfq

(D)≥ P0

(D)+ E0

[I(D)ξn]

≥ P0

(D)+ (1− δ)P0

(D ∩ {ξn > 1− δ}

)≥ (1− δ)P0

(ξn > 1− δ

)

where

ξn =1

Q

Q∑

q=1

exp{ln(dPfq/dP0

) }.

As shown in Lemma 12.2, the random variable ξn converges to 1 as n → ∞.Hence,

lim infn→∞

[P0

(Δn = 1

)+ max

1≤ q≤QPfq

(Δn = 0

) ]≥ 1− δ.

Note that if C∗ < ‖ϕ‖∞, then all the test functions fq, q = 1, . . . , Q, belongto the set of alternatives

Λn(β,C∗, ψn) ={f : f ∈ Θ(β) and ‖f‖ ≥ C∗ ψn

}.

Thus the lower bound (16.3) holds with r∗ = 1− δ however close to 1. �

Page 242: Mathematical Statistics - 213.230.96.51:8090

16.3. Sequence Space. Separation Rate in the L2-Norm 231

16.3. Sequence Space. Separation Rate in the L2-Norm

Analyzing the proof of Theorem 16.1, we find the two remarkable propertiesof the hypotheses testing in the sup-norm. First, the separation rate ψn

coincides with the minimax rate of estimation, and the minimax optimaldecision rule if very simple: the null hypothesis is accepted if the sup-normof the estimator is small enough. Second, the choice of a sufficiently largeconstant C∗ in the definition of the alternative hypothesis Λn(β,C

∗, ψn)guarantees the upper bound for arbitrarily small error probability γ. More-over, C∗ does not depend on the value of γ. It happens because the sup-normis a very special type of norm. The distribution of ‖f∗

n‖∞ is degenerate. IfC∗ is large enough, then Pf

(‖f∗

n‖∞ ≥ C∗ ψn

)→ 0 as n → ∞.

In this section, we turn to the quadratic norm. To ease the presentation,we consider the problem of hypotheses testing in the sequence space. Allthe necessary definitions are given in Section 10.5. The observations zk inthe sequence space are

zk = ck + σ ξk/√n, k = 0, . . . , n− 1,

where the ck’s are the Fourier coefficients of the regression function f , thatis, f =

∑n−1k=0 ckϕk. Here the ϕk’s form an orthonormal basis in the discrete

L2-norm. The errors ξk are independent standard normal random variables,and σ > 0 represents the standard deviation of the observations in theoriginal regression model.

We use c =(c0, . . . , cn−1

)to denote the whole set of the Fourier coeffi-

cients. As in Section 15.3, it is convenient to work directly in the sequencespace of the Fourier coefficients. For ease of reference, we repeat here thedefinition of the following class:

Θ2, n(β) = Θ2, n(β, L) ={ (

c0, . . . , cn−1

):

n−1∑

k=0

c2k k2β ≤ L

}.

We want to test the null hypothesis that all the Fourier coefficients are

equal to zero versus the alternative that their L2-norm ‖c‖2 =(∑

c2k)1/2

is larger than a constant that may depend on n. Our goal is to find theminimax separation rate ψn. Formally, we study the problem of testing H0 :c = 0 against the composite alternative

H1 : c ∈ Λn = Λn(β,C, ψn)

where

(16.5) Λn ={c : c ∈ Θ2,n(β) and ‖c‖2 ≥ C ψn

}.

In Section 13.1, it was shown that the squared L2-norm of regressionin [0, 1] can be estimated with the minimax rate 1/

√n. This is true in the

Page 243: Mathematical Statistics - 213.230.96.51:8090

232 16. Testing of Nonparametric Hypotheses

sequence space as well. The proof in the sequence space is especially simple.Indeed, the sum of the centered z2k ’s admits the representation

n−1∑

k=0

(z2k − σ2

n

)= ‖c‖22 − 2σ√

n

n−1∑

k=0

ck ξk +σ2

n

n−1∑

k=0

(ξ2k − 1)

(16.6) = ‖c‖22 − 2σ√nN +

σ2

√nYn

where N denotes the zero-mean normal random variable with the variance‖c‖22. The variable Yn is a centered chi-squared random variable that isasymptotically normal,

Yn =n−1∑

k=0

(ξ2k − 1)/√n → N (0, 2).

The convergence rate 1/√n in estimation of ‖c‖22 follows immediately from

(16.6).

Now we continue with testing the null hypothesis against the compositealternative (16.5). We will show that the separation rate of testing in thequadratic norm is equal to ψn = n−2β/(4β+1). Note that this separation rateis faster than the minimax estimation rate in the L2-norm n−β/(2β+1). Theproof of this fact is split between the upper and lower bounds in the theoremsbelow.

We introduce the rate-optimal decision rule, proceeding similar to (16.6).

We take Mn = n2/(4β+1), so that the separation rate ψn = M−βn , and esti-

mate the norm of the Fourier coefficients by

Sn =

Mn∑

k=0

(z2k − σ2/n

).

Consider a class of decision rules Δn that depends on a constant b,

(16.7) Δn = Δn(β, b) =

{0, if Sn < bψ2

n = b n−4β/(4β+1),

1, otherwise.

The following theorem claims that by choosing properly the constants inthe definitions of the set of alternatives and the decision rule, we can makethe error probabilities less than any prescribed number in the sense of theupper bound (16.2).

Theorem 16.2. For any small positive γ, there exist a constant C = C∗ =C∗(γ) in the definition (16.5) of the set of alternatives Λn = Λn(β,C

∗, ψn),

Page 244: Mathematical Statistics - 213.230.96.51:8090

16.3. Sequence Space. Separation Rate in the L2-Norm 233

and a constant b = b(γ) in the definition (16.7) of decision rule Δn = Δ∗n =

Δ∗n(β, b) such that

lim supn→∞

rn(Δ∗n) ≤ γ

where

rn(Δ∗n) = P0

(Δ∗

n = 1)+ sup

c∈Λn(β,C∗,ψn)Pc

(Δ∗

n = 0).

Proof. It suffices to show that for all sufficiently large n, probabilities oftype I and II errors are bounded from above by γ/2, that is, it suffices toshow that

(16.8) lim supn→∞

P0

(Δ∗

n = 1)≤ γ/2

and

(16.9) lim supn→∞

supc∈Λn(β,C∗,ψn)

Pc

(Δ∗

n = 0)≤ γ/2.

Starting with the first inequality, we write

P0

(Δ∗

n = 1)= P0

(Sn ≥ b ψ2

n

)= P0

( Mn∑

k=0

(z2k − σ2/n

)≥ b ψ2

n

)

= P0

( σ2

n

Mn∑

k=0

(ξ2k − 1) > bψ2n

)= P0

(σ2 n−1

√2(Mn + 1)Yn > bψ2

n

)

where Yn =∑Mn

k=0 (ξ2k−1)/

√2(Mn + 1) is asymptotically standard normal

random variable. Under our choice of Mn , we have that as n → ∞,

n−1√

Mn + 1 ∼ n1/(4β+1)−1 = n−4β/(4β+1) = ψ2n.

Consequently,

P0

(Δ∗

n = 1)= P0

(√2σ2 Yn ≥ b

(1 + on(1)

) )→ 1 − Φ

( b√2σ2

),

as n → ∞, where Φ denotes the cumulative distribution function of a stan-dard normal random variable. If we choose b =

√2σ2 q1−γ/2 with q1−γ/2

standing for the (1 − γ/2)-quantile of the standard normal distribution,then the inequality (16.8) follows.

To verify the inequality (16.9), note that

Pc

(Δ∗

n = 0)= Pc

(Sn ≤ b ψ2

n

)= Pc

( Mn∑

k=0

(z2k − σ2/n

)≤ b ψ2

n

)

= Pc

(‖c‖22 −

n−1∑

k=Mn+1

c2k − 2σ√n

Mn∑

k=0

ck ξk +σ2

n

Mn∑

k=0

(ξ2k − 1) ≤ b ψ2n

).

Page 245: Mathematical Statistics - 213.230.96.51:8090

234 16. Testing of Nonparametric Hypotheses

Observe that for any c ∈ Λn(β,C∗, ψn), the variance of the following nor-

malized random sum vanishes as n → ∞,

Varc

[2σ√n‖c‖22

Mn∑

k=0

ckξk

]≤ 4σ2

n‖c‖22≤ 4σ2

n(C∗ψn)2=( 2σC∗

)2n−1/(4β+1) → 0,

which implies that

‖c‖22 − 2σ√n

Mn∑

k=0

ck ξk = ‖c‖22(1 + on(1)

)as n → ∞,

where on(1) → 0 in Pc-probability. Thus,

Pc

(Δ∗

n = 0)= Pc

( σ2

n

Mn∑

k=0

(ξ2k−1) ≤ −‖c‖22(1+on(1)

)+

n−1∑

k=Mn+1

c2k+b ψ2n

).

Put Yn =∑Mn

k=0(ξ2k − 1)/

√2(Mn + 1). Note that

n−1∑

k=Mn+1

c2k <n−1∑

k=Mn+1

( k

Mn

)2βc2k ≤ M−2β

n L.

Therefore,

Pc

(Δ∗

n = 0)≤ Pc

(σ2

n

√2(Mn + 1)Yn

≤ −(C∗ψn)2(1 + on(1)

)+M−2β

n L+ bψ2n

)

where Yn is asymptotically standard normal. Note that here every term has

the magnitude ψ2n = M−2β

n . If we cancel ψ2n, the latter probability becomes

Pc

(√2σ2 Yn ≤ (−C∗ + L+ b)

(1 + on(1)

) )→ Φ

( −C∗ + L+ b√2σ2

),

as n → ∞. Choose C∗ = 2b+L and recall that b =√2σ2q1−γ/2. We obtain

−C∗ + L+ b√2σ2

=−b√2σ2

= − q1−γ/2 = qγ/2

where qγ/2 denotes the γ/2-quantile of Φ. Thus, the inequality (16.9) isvalid. �

Remark 16.3. In the case of the sup-norm, we can find a single constantC∗ to guarantee the upper bound for any γ. In the case of the L2-norm, itis not possible. Every γ requires its own constants C∗ and b. �

As the next theorem shows, the separation rate ψn = n−2β/(4β+1) can-not be improved. If the constant C in the definition (16.5) of the set ofalternatives Λn(β,C, ψn) is small, then there is no decision rule that wouldguarantee arbitrarily small error probabilities.

Page 246: Mathematical Statistics - 213.230.96.51:8090

16.3. Sequence Space. Separation Rate in the L2-Norm 235

Theorem 16.4. For any constant r∗, 0 < r∗ < 1, there exists C = C∗ > 0 inthe definition (16.5) of the set of alternatives Λn such that for any decisionrule Δn, the sum of the error probabilities

rn(Δn) = P0

(Δn = 1

)+ sup

c∈Λn(β,C∗,ψn)Pc

(Δn = 0

)

satisfies the inequality lim infn→∞ rn(Δn) ≥ r∗.

Proof. Let Mn = n2/(4β+1) = ψ−1/βn . Introduce a set of 2Mn binary se-

quences

Ωn ={ω =

(ω1, . . . , ωMn

), ωk ∈ {−1, 1}, k = 1, . . . ,Mn

}.

Define a set of alternatives Λ(0)n with the same number of elements 2Mn ,

Λ(0)n =

{c = c(ω) : ck = C∗ ψn ωk /

√Mn if k = 1, . . . ,Mn,

and ck = 0 otherwise, ω ∈ Ωn

}

where a positive constant C∗ will be chosen later. Note that if C∗ is smallenough, C2

∗ < (2β + 1)L, then

(16.10) Λ(0)n ⊂ Λn(β,C∗, ψn).

Indeed, if c ∈ Λ(0)n , then

n−1∑

k=0

c2k k2β =

(C∗ψn)2

Mn

Mn∑

k=1

k2β ∼ (C∗ψn)2

Mn

M2β+1n

2β + 1=

C2∗

2β + 1< L.

Thus, every c ∈ Λ(0)n belongs to Θ2,n(β, L). Also, the following identity takes

place:

‖c‖2 =[ Mn∑

k=1

(C∗ψnωk)2/Mn

]1/2= C∗ψn,

which implies (16.10).

Next, we want to show that for any decision rule Δn, the followinginequality holds:

(16.11) lim infn→∞

[P0

(Δn = 1

)+ max

ω ∈Ωn

Pc(ω)

(Δn = 0

) ]≥ r∗.

Put

α2n =

( C∗ψn

σ

)2 n

Mn=( C∗

σ

)2n−1/(4β+1) → 0, as n → ∞.

Further, we substitute the maximum by the mean value to obtain

rn(Δn) ≥ P0

(Δn = 1

)+ max

ω ∈Ωn

Pc(ω)

(Δn = 0

)

Page 247: Mathematical Statistics - 213.230.96.51:8090

236 16. Testing of Nonparametric Hypotheses

≥ P0

(Δn = 1

)+ 2−Mn

ω ∈Ωn

Pc(ω)

(Δn = 0

)

= E0

[I(Δn = 1

)+ I(Δn = 0

)2−Mn

ω ∈Ωn

exp{Ln(ω)

} ]

= E0

[I(Δn = 1

)+ I(Δn = 0

)ηn

]

where

Ln(ω) = lndPc(ω)

dP0and ηn = 2−Mn

ω ∈Ωn

exp{Ln(ω)

}.

Now, the log-likelihood ratio

Ln(ω) =n

σ2

Mn∑

k=1

(ck zk − c2k/2

)=

Mn∑

k=1

(αn ωk ξk − α2

n/2).

Here we have used the fact that, under P0, zk = σ ξk/√n. In addition, the

identities ω2k = 1 and

√n ck/σ = (C∗ψn/σ)

√n/Mn ωk = αn ωk

were employed. The random variable ηn admits the representation, whichwill be derived below,

(16.12) ηn = exp{− 1

2α2nMn

} Mn∏

k=1

[ 12eαn ξk +

1

2e−αn ξk

].

Even though this expression is purely deterministic and can be shown alge-braically, the easiest way to prove it is by looking at the ωk’s as independentrandom variables such that

P(ω)(ωk = ± 1

)= 1/2.

Using this definition, the random variable ηn can be computed as the ex-pected value, denoted by E

(ω), with respect to the distribution P(ω),

ηn = E(ω)[exp

{Ln(ω)

} ]=

Mn∏

k=1

E(ω)[exp

{αnξkωk −

1

2α2n

} ]

= exp{− α2

nMn/2} Mn∏

k=1

E(ω)[exp

{αn ξk ωk

} ]

so that the representation (16.12) for ηn follows.

Recall that ξk are independent standard normal random variables withrespect to the P0-distribution, hence, E0[ ηn ] = 1. To compute the secondmoment of ηn, we write

E0[ η2n ] = exp

{− α2

nMn

}(E0

[ 14e2αnξ1 +

1

2+

1

4e−2αnξ1

] )Mn

Page 248: Mathematical Statistics - 213.230.96.51:8090

Exercises 237

= exp{−α2

nMn

}(12e2α

2n+

1

2

)Mn

= exp{−α2

nMn

}(1+α2

n+α4n+o(α4

n))Mn

= exp{− α2

nMn+(α2n+α4

n/2+ o(α4n))Mn

}= exp

{α4nMn/2+ o(α4

nMn)}.

From the definition of αn, we have

α4nMn =

((C∗/σ)

2 n−1/(4β+1))2

Mn = (C∗/σ)4.

Thus, as n → ∞, we find that o(α4nMn) → 0 and E0[ η

2n ] → exp

{C4∗/(2σ

4)}.

Then the variance Var0[ ηn ] ∼ exp{C4∗/(2σ

4)}

− 1 for large n. For anyδ > 0, by the Chebyshev inequality,

lim infn→∞

P0

(ηn ≥ 1− δ

)≥ 1 − δ−2

(exp

{C4∗/(2σ

4)}− 1

).

The right-hand side can be made arbitrarily close to 1 if we choose suffi-ciently small C∗. Finally, we obtain that

lim infn→∞

rn(Δn) ≥ lim infn→∞

E0

[I(Δn = 1

)+ I(Δn = 0

)ηn

]

≥(1−δ

)lim infn→∞

P0

(ηn ≥ 1−δ

)≥(1−δ

)[1− δ−2

(exp

{C4∗/(2σ

4)}− 1) ]

.

By choosing a small positive δ and then a sufficiently small C∗, we can makethe right-hand side larger than any r∗ < 1, which proves the lower bound(16.11). �

Exercises

Exercise 16.97. (Fundamental Neyman-Pearson Lemma) Assume that fora given α > 0 , there exists a constant c > 0 such that P0(Ln ≥ c) = α

where Ln =∑n

i=1 ln(p(Xi, θ1)/p(Xi, 0)

). Put Δ∗

n = I(Ln ≥ c). Let Δn be

a decision rule Δn which probability of type I error P0(Δn = 1) ≤ α. Showthat the probability of type II error of Δn is larger than type II error of Δ∗

n,that is, Pθ1(Δn = 0) ≥ Pθ1(Δ

∗n = 0).

Page 249: Mathematical Statistics - 213.230.96.51:8090
Page 250: Mathematical Statistics - 213.230.96.51:8090

Bibliography

[Bor99] A.A. Borovkov, Mathematical statistics, CRC, 1999.

[Efr99] S. Efromovich, Nonparametric curve estimation: Methods, theory and appli-cations, Springer, 1999.

[Eub99] R. L. Eubank, Nonparametric regression and spline smoothing, 2nd ed., CRC,1999.

[Har97] J. Hart, Nonparametric smoothing and lack-of-fit tests, Springer, 1997.

[HMSW04] W. Hardle, M. Muller, S. Sperlich, and A. Werwatz, Nonparametric and semi-parametric models, Springer, 2004.

[IH81] I.A. Ibragimov and R.Z. Has’minski, Statistical estimation. Asymptotic theory,Springer, 1981.

[Mas07] P. Massart, Concentration inequalities and model selection, Springer, 2010.

[Pet75] V.V. Petrov, Sums of independent random variables, Berlin, New York:Springer-Verlag, 1975.

[Sim98] J.S. Simonoff, Smoothing methods in statistics, Springer, 1996.

[Tsy08] A.B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009.

[Was04] L. Wasserman, All of statistics: A concise course in statistical inference,Springer, 2003.

[Was07] , All of nonparametric statistics, Springer, 2005.

239

Page 251: Mathematical Statistics - 213.230.96.51:8090
Page 252: Mathematical Statistics - 213.230.96.51:8090

Index of Notation

(D−1∞ )l,m, 120

AR(fn), 212

Bq, 132, 158

Cb, 117

Cn, 14

Cn (X1, . . . , Xn), 14

Cv, 117

H, 78

H(θ0, θ1), 31

I(θ), 6

In(θ), 7

K(u), 105

K±, 58

Ksgn(i), 58

LP (u), 156

LS(u), 156

L′n(θ), 6

Ln(θ), 4

Ln(θ |X1 , . . . , Xn), 4

N , 115

N(x, hn), 106

Nq, 133

Pk(u), 156

Q, 131

Rβ , 219

Rn(θ, θn, w), 11

Rn(f, fn), 102

Sm(u), 153

Tn, 46

W (j), 52

X1, . . . , Xn, 3

Z1(θ0, θ1), 32

Zn(θ0, θ1), 32, 45

ΔH, 78

ΔLn, 24

ΔLn(θ0 , θ1), 24

Δθ, 45

Δn, 227

Φl(f), 201

Ψ′(f), 188

Ψ(f), 185, 188

Ψ∗n, 190

Θ, 3

Θ(β), 102

Θ(β, L, L1), 101

Θ(β, L, L1, g∗), 200

Θα, 51, 75

Θ2,n, 145

Θ2,n(β,L), 145

Xn, 5

βn(θn, w, π), 13

θ, 88

θ, 87

ε, 87

η, 78

γ0, 117

γ1, 118

γk(x), 139

γm,q(x), 138

fn(t), 118

fn(x), 90, 102, 104

Φ(l)n , 202

Ψn, 186

τn, 69

241

Page 253: Mathematical Statistics - 213.230.96.51:8090

242 Index of Notation

θ, 4

θτ , 78

θn, 4

θn(X1, . . . , Xn), 4

θm,q, 133λ(θ), 46λ0, 48

[θn

], 5

Ef [ · ], 102Ef [ · | X ], 102Eθ[ · | X ], 89I(·), 11Varθ

[θn

], 7

Varθ[ · | X ], 89D, 90D∞, 95G, 87H0, 227H1, 227In, 87y, 87gj , 87r, 92y, 87F , 66F(A, a), 212Fτ , 67Ft, 65H, 200Lp, 156Ls, 156Nm, 118Nm,q, 133S, 87T , 69Tγ , 69X , 86π(θ), 13ψn, 23, 103ρ(xi, x), 106τ , 66τ∗n, 70θ ∗n , 16, 76θ ∗n (X1, . . . , Xn), 16

Cn, 14

f(θ |X1 , . . . , Xn), 14υn, i(x), 104υn,i, 78εi, 52ξ, 56ξn(x), 102

ξn(x,X ), 103ak, 142bk, 142bm, 118b′n (θ), 7bn(θ), 5

bn(θ , θn), 5bn(x), 102bn(x,X ), 103, 104bm,q, 133c, 51cq, 132f(θ |X1, . . . , Xn), 14f∗n(x), 109

f∗n(t), 118

hn, 105h∗n, 109

l′′(Xi, θ), 28l′(Xi , θ), 6l(Xi , θ), 4li, 57p(x, θ), 3p0(x− θ), 4r∗, 23rDn , 69r∗, 23rn(Δn), 228

rn(θn, w), 16tn, 13tn(X1, . . . , Xn), 13w(u), 11

wl(x(1), x(2)), 202

zn(θ), 29

Page 254: Mathematical Statistics - 213.230.96.51:8090

Index

B-spline

standard, 153

B-splines, 152

shifted, 155

Ft-measurable event, 65

σ-algebra, 65

absolute loss function, 11

acceptance of null hypothesis, 227

adaptation, 211

adaptive estimator, 211

adaptive risk, 212

additive regression model, 197

anisotropic Holder class of functions,209

asymptotically exponentialstatistical experiment, 46

asymptotically Fisherefficient estimator, 22

asymptotically minimax

estimator, 103

lower bound, 23

rate of convergence, 23, 103

estimator, 33

asymptotically sharp minimaxbounds, 23

asymptotically unbiasedestimator, 21

autoregression coefficient, 75

autoregression, see autoregressivemodel, 75

coefficient, 75

autoregressive model, 75

balance equation, 108bandwidth, 105

optimal, 108basis

complete, 142

orthonormal, 141trigonometric, 142

Bayes estimator, 13

Bayes risk, 13bi-square kernel function, 105bias, 5

bin, 132bounded loss function, 11

change point, 51change-point problem, 51

complete basis, 142composite alternative hypothesis, 228conjugate family of distributions, 15

conjugate prior distribution, 15covariance matrix, 90

limiting, 94

Cramer-Rao inequality, 7Cramer-Rao lower bound, 7

decision rule, 227design, 86

regular deterministic, 93regular random, 95uniform, 94

design matrix, 87detection, see on-line detectionproblem, 69

243

Page 255: Mathematical Statistics - 213.230.96.51:8090

244 Index

detector, see on-line detector, 69deterministic regular design, 93differentiable functional, 188

efficient estimator, 8Epanechnikov kernel function, 105estimator, 4

asymptotically unbiased, 21maximum likelihood (MLE), 4, 33projection, 147adaptive, 211asymptotically Fisher

efficient, 22Bayes, 13efficient, 8Fisher efficient, see efficient, 8global linear, 105linear, 104local linear, 105minimax, 16more efficient, 12on-line, 78orthogonal series, 147sequential, 69, 78smoothing kernel, 107superefficient, 22unbiased, 5

expected detection delay, 69explanatory variable, 85

false alarm probability, 69filter, 66first-order autoregressivemodel, 75Fisher efficient estimator, seeefficient estimator, 8Fisher information, 6, 7Fisher score function, 6

total, 6fitted response vector, 87functional

differentiable, 188integral quadratic, 188linear, 186linear integral, 185non-linear, 188non-linear integral, 188

global linear estimator of regressionfunction, 105

Holder class of functions, 101anisotropic, 209

Lipschitz condition, 102smoothness, 101

Hellinger distance, 31Hodges’ example, 22hypotheses testing

parametric, 227acceptance of null hypothesis, 227composite alternative hypothesis, 228decision rule, 227minimax separation rate, 229nonparametric, 228null hypothesis, 227rejection of null hypothesis, 227separation between hypotheses, 228simple alternative hypothesis, 227type I error, 227type II error, 228

hypothesissimple alternative, 227composite alternative, 228null, 227

integral functional, 185integral quadratic functional, 188irregular statistical experiment, 43

kernel estimatorNadaraya-Watson, 106optimal smoothing, 109smoothing, 107

kernel function, 105Epanechnikov, 105bi-square, 105tri-cube, 112triangular, 105uniform, 105

kernel, see kernel function, 105Kullback-Leibler informationnumber, 58

LAN, see local asymptoticnormality condition, 29least-squares estimator

of regression coefficient, 88of regression function, 90of vector of regression coefficients, 89

likelihood ratio, 45limiting covariance matrix, 94linear estimator, 104linear functional, 186linear integral functional, 185linear parametric regression

Page 256: Mathematical Statistics - 213.230.96.51:8090

Index 245

model, 86linear span-space, 87Lipschitz condition, 102Lipschitz function, 102local asymptotic normality(LAN) condition, 29local linear estimator of regression

function, 105local polynomial approximation, 115local polynomial estimator, 118location parameter, 4log-likelihood function, 4log-likelihood ratio, 24loss function, 11

absolute, 11bounded, 11quadratic, 11sup-norm, 102

lower boundasymptotically minimax, 23Cramer-Rao, 7minimax, 18

Markov stopping time, seestopping time, 66maximum likelihood estimator(MLE), 4, 33maximum normalized risk, 16, 103mean integrated squared error (MISE),

90mean squared error (MSE), 90mean squared risk at a point, see mean

squared error (MSE), 90measurable event, seeFt-measurable event, 65

up to random time, 68minimax estimator, 16minimax lower bound, 18minimax risk, 16

of detection, 69minimax risk of detection, 69minimax separation rate, 229more efficient estimator, 12multiple regression model, 193

Nadaraya-Watson kernelestimator, 106non-linear functional, 188non-linear integral functional, 188nonparametric hypotheses testing, 228nonparametric regression model, 101normal equations, 88

normalized quadratic risk, 12normalized risk, 11

maximum, 16normalized risk function, seenormalized risk, 11null hypothesis, 227

on-line detection problem, 69on-line detector, 69on-line estimation, 78on-line estimator, 78optimal bandwidth, 108, 118optimal smoothing kernelestimator, 109orthogonal series, see projection

estimator, 147orthonormal basis, 141

parametric hypotheses testing, 227parametric regression model, 85

linear, 86random error in, 86

partition of unity, 153pixel, 208point estimator, see estimator, 4polynomial regression, 86posterior density, 14

weighted, 14posterior mean, 14

non-weighted, 14weighted, 14

power spline, 156predicted, see fitted responsevector, 87predictor variable, see explanatoryvariable, 85prior density, 13prior distribution, 13

conjugate, 15projection, see orthogonal series

estimator, 147

quadratic loss function, 11

random error, 85random regular design, 95random time, 68random walk,two-sided Gaussian, 52rate of convergence, 23regression coefficient, 85

least-squares estimator of, 88regression equation, 85, 101

Page 257: Mathematical Statistics - 213.230.96.51:8090

246 Index

regression function, 85global linear estimator of, 105least-squares estimator of, 90linear estimator of, 104local linear estimator of, 105

regression modelsimple linear, 96simple linear through the origin, 96additive, 197multiple, 193nonparametric, 101parametric, 85simple, 85single-index, 199

regressogram, 133regular deterministic design, 93regular random design, 95regular statistical experiment, 7rejection of null hypothesis, 227residual, 92response variable, 85response, see response variable, 85risk, 11risk function, 11

normalized quadratic, 12normalized, 11

sample mean, 5scaled spline, 158scatter plot, 86separation between hypotheses, 228sequence space, 146sequential estimation, 65, 69, 78sequential estimator, 69, 78shifted B-splines, 155sigma-algebra, seeσ-algebra, 65signal-to-noise ratio, 51simple alternative hypothesis, 227simple linear regression model, 96simple linear regression through the

origin, 96simple regression model, 85single-index regression model, 199smoothing kernel, 107smoothing kernel estimator, 107

optimal, 109smoothness of Holder classof functions, 101spline

B-spline, 152power, 156

scaled, 158shifted B-spline, 155standard B-spline, 153

standard B-spline, 153statistical experiment, 3

regular, 7asymptotically exponential, 46irregular, 43

stopping time, 66sup-norm loss function, 102superefficient estimator, 22superefficient point, 22

test function, 123, 168time, 65

random, 68total Fisher score function, 6tri-cube kernel function, 112triangular kernel function, 105trigonometric basis, 142two-sided Gaussian randomwalk, 52type I error, 227type II error, 228

unbiased estimator, 5uniform design, 94uniform kernel function, 105

vector of regression coefficients, 87vector of regression coefficients

least-squares estimator of, 89

Wald’s first identity, 66Wald’s second identity, 83weight function, 185weighted posterior density, 14weighted posterior mean, 14

Page 258: Mathematical Statistics - 213.230.96.51:8090

GSM/119

For additional informationand updates on this book, visit

www.ams.org/bookpages/gsm-119

www.ams.orgAMS on the Webwww.ams.org

This book is designed to bridge the gap between traditional textbooks in statistics and more advanced books that include the sophisticated nonparametric techniques. It covers topics in parametric and nonparametric large-sample estimation theory. The exposition is based on a collection of relatively simple statistical models. It gives a thorough mathematical analysis for each of them with all the rigorous proofs and explanations. The book also includes a number of helpful exercises.

Prerequisites for the book include senior undergraduate/beginning graduate-level courses in probability and statistics.

gsm-119-korostelev-cov.indd 1 12/7/10 4:06 PM