statistical properties of numerical derivativesdoubleh/eco273/presentation...consistency of...

Statistical Properties of Numerical Derivatives

Han Hong, Aprajit Mahajan, and Denis Nekipelov

Stanford University and UC Berkeley

November 2010

1 / 63

Introduction

Motivation

Many models have objective functions that are intensive to compute.

Sample objective functions might not have well-behaved analyticderivatives.

Common sense solution: use numerical derivatives (Judd (1998)).

Statistical problem:

Extra nuissance parameter (step size of numerical differentiation),

Not clear whether results from estimators with nuissance parameters(e.g. bandwidth) translate directly to the case of numerical derivative.

Can we obtain good estimators by solving numerical first-orderconditions?

Can we provide relatively weak conditions on step size if we are onlyinterested in value of derivative?

2 / 63

Introduction

Motivation

Example (see (Judd, 1998)): Need to compute gradient of gapproximated by g

Can use approximation with step size h:

g ′(x) = Lh1,1g(x) =

g(x + h)− g(x)

h= g ′(x) + O (h)

or

g ′(x) = Lh1,2g(x) =

g(x + h)− g(x − h)

2h= g ′(x) + O

(h2)

More generally, higher order differentiation.

Bias and variance tradeoffs.

3 / 63

Introduction

Motivation

Conclusion: careful analysis of function properties and application ofhigher order numerical derivative technique reduces error by orders ofmagnitude.

The case where g is estimated is more complicated:

g(x + h) and g(x − h) are correlated

Consistency of derivatives depends on choices of approximating formulaand step size.

Results depend on tradeoffs between (a) Quality of numericalapproximation (b) choice of step size (c) smoothness of populationobjective (d) Empirical process properties for g .

4 / 63

Introduction

Literature background

Anderssen and Bloomfield (1974): differentiation of functionsmeasured with noise

Newey and McFadden (1994): sufficient conditions for convergence ofnumerical Hessian

Newey (1994), Newey and McFadden (1994): numerical Hessian insemiparametric model.

L’Ecuyer and Perron (1994): convergence rate and asymptotics forbasic finite-difference formulas for smooth functions

Powell (1984): Censored LAD.

Buchinsky and Hahn (1998): Alternative censored QR.

Simulation estimation of nonsmooth models (Pakes and Pollard(1989)).

Maximum score and smoothing (Manski (1975), Horowitz (1991)).

5 / 63

Introduction

What we find

Find convergence rates and statistical properties of numericalderivatives of possibly non-smooth objective functions.

While in some cases choice of step of numerical differentiation isbased on bias-variance tradeoff, in smooth classes of models there willbe no such tradeoff.

First-order conditions using numerical derivatives can be still usedeven if the sample objective function is not smooth (e.g. maximumscore) provided that step size is selected correctly.

Smoothness of objective function influences the precise convergencerate of derivative estimator and its limit distribution.

Requirements on rate of convergence of the numerical derivativerestrict the order of numerical differentiation operator.

6 / 63

Introduction

Outline of the talk

Consistency of numerical partial derivatives in parametric andsemiparametric models.

Sufficient conditions for consistency

Rates of convergence and distribution theory

M-estimators based on numerical first-order conditions.

Consistency of estimators

Convergence rate and optimal choice of step of numericaldifferentiation

Monte-Carlo evidence.

7 / 63

Introduction

Numerical derivative: definition

Consider function f (·) with p continuous derivatives.

We consider a class of linear operators L defined by step size ε > 0,order m ∈ N, weights ak ∈ [−1, 1] and tk ∈ [−1, 1] for k = 1, . . . ,Ksuch that

L f (x0) =1

εm

K∑k=1

ak f (x0 + ε tk) .

We call Lεm,q the m-th numerical derivative operator of orderq ≤ p −m + 1 if

Lεm,q g (x0) = g (m) (x0) + o (εq) .

8 / 63

Introduction

Numerical derivative: structure

To obtain weights use Taylor’s expansion for small ε

f (x0 + ε tk) =

p∑i=0

f (i) (x0)

i !εi t ik + o (εp)

Solve the system of equations

1

i !

K∑k=1

ak t ik = δi ,m

with boundary conditions t1 = −1, and tK = 1 where δi ,m is theKronecker symbol.

Example:

D3

(θ)

= Lεn1,3 =−Mn(θ−2εn)+8Mn(θ−εn)−8Mn(θ+εn)+Mn(θ+2εn)

12εn.

9 / 63

Consistency of numerical derivatives

Statement of the problem

Parameter vectors (θ, h(·)).

θ: Euclidean components.

h (·): infinite-dimensional components.

Moment condition

m (z ; θ0, h0(·)) = E [ρ (x ; θ0, h0(·)) | z ] = 0,

where ρ(·) can possibly be non-smooth, e.g. Ai and Chen (2003),Chen and Pouzo (2009).

Analytic expression might not exist for either derivative w.r.t. θ ordirectional derivative w.r.t. h(·) or both.

Remark: even if ρ(·) is linear, there might be no analytic expressionfor the derivative, e.g.

ρ (x ; θ0, h0(·)) = h0(z)− α h0(x)− f (x , z ; θ) .

10 / 63



Conditional moment might not be pointwise differentiable.

Conditional moment is “sufficiently smooth” in mean squares.

E

[∥∥∥∥m (z ; θ, h(·))−∆1θ (θ − θ0)−∆1h[δ]− . . .

−∑

p1+p2=p∆p,θp1 ,hp2 [δ]p1 (θ − θ0)p2

∥∥∥∥2]= o

(‖δ‖2ν

2 + ‖θ − θ0‖2ν),

Can still use numerical derivative operator, for example,

Dwj (z) =m(

z ; θ + ejεn, h(·))− m

(z ; θ − ejεn, h(·)

)2εn

−m(

z ; θ, h(·) + τn wj(·))− m

(z ; θ, h(·)− τn wj(·)

)2τn

,

11 / 63



Need to compute numerically, for each wj ,

Dwj (z) =∂m (z ; θ, h(·))

∂θj− ∂m (z ; θ, h(·))

∂h[wj ],

The asymptotic variance depends on w∗j , which solves

minwj∈W

Dwj (z)′Σ0(z)−1 Dwj (z),

and Σ0(z) = E[ρ (x ; θ0, h0(·)) ρ (x ; θ0, h0(·))′ | z

].

Use h(·), and θ, and numerically evaluated w∗ to estimate theasymptotic variance.

12 / 63


Numerical derivative: problem

This structure extends to the case where f is a functional and X isfunctional space, for taking directional derivatives.

In particular, consider first derivative of order q:

Dw∗j(z) = Lεn,θ1,q m

(z ; θ, h(·)

)− Lτn,θ1,q m

(z ; θ, h(·)

).

Question 1: How to choose τn and εn for n→∞ for a derivative ofgiven order q?

Sufficient condition for step sizes of partial and directional derivativessuch that

Dw∗j(z)

p−→ Dw∗j(z).

13 / 63


Numerical derivative: solution

Conditional moment might not be pointwise differentiable but hascontinuous L2 derivatives.

High-level assumption for convergence of conditional expectation:

sup(θ, h(·))∈Uγ

n1/k‖m(z; θ, h(·))−m(z; θ, h(·))−m(z; θ0, h0(·))‖1+n1/k‖m(z; θ, h(·))‖+n1/k‖m(z; θ, h(·))‖ = op(1),

for any sequence of shrinking neighborhoods Uγ as γ → 0.

More precise conditions can be given for particular estimator of m (·).

Nonparametric rate for h (·), for some k1:

n1/k1

∥∥∥h(·)− h0(·)∥∥∥ = Op(1), and n1/2

∥∥∥θ − θ0

∥∥∥ = Op(1).

14 / 63


Numerical derivative: solution

Theorem (1)

Under provided assumptions, if

εn n1/maxk, k1 →∞ εn → 0

andτn n1/maxk, k1 →∞ τn → 0

thensupz∈Z|Dw (z)− Dw (z) | p−→ 0

.

15 / 63


Numerical derivative: discussion

This result follows closely existing results in Newey and McFadden(1994) and Powell (1984)

Provided conditions imply the rate slower than 1/√

n for εn inparametric case

In semiparametric case need stronger condition to control for slownon-parametric rate

Such sufficient conditions are too strong

Sharper results can be derived. Even for nonsmooth models, onlyneed nε→∞. in the parametric case.

16 / 63


Numerical derivative: parametric case

Can derive sharper results if know more detail about moment function

Moment vector

g (θ) =1

n

n∑i=1

g (Zi , θ) ,

estimator θ equates it to zero

Need to estimate G = ∂g(θ0)∂θ using Lεn1,p g

(θ)

Decompose Lεn1,p g(θ)− G = G1 + G2 + G3

G1 =[Lεn1,p g

(θ)− Lεn1,pg

(θ)]

G2 = Lεn1,pg(θ)− G

(θ)

G3 = G(θ)− G .

G3 = Op

(θ − θ0

), but it has no relation to εn.

17 / 63



Assumption (1)

A 2p + 1th order mean value expansion applies to the limiting functiong (θ) uniformly in a neighborhood of θ0. For all ε sufficiently small andr = 2p + 1,

supθ∈N (θ0)

∣∣∣∣g (θ)−r∑

l=0

εl

l!g (l) (θ)

∣∣∣∣ = O(εr+1

).

The bias term G2

(θ)

can be controlled if the bias reduction is

uniformly small in a neighborhood of θ0.

An immediate consequence of this assumption is that

G2

(θ)

= O(ε2p).

18 / 63



The weakest possible condition to control G1

(θ)

that covers all the

models that we are aware of seems to come from a convergence rateresult in Pollard (1984).

Assumption (2)

Define F = g(·, θ), θ ∈ Θ. The graphs of functions in F has polynomialdegrees of discrimination.

Most of the functions in econometric applications fall in this category.By lemmas 25 and 36 of Pollard (1984), assumption 2 implies thatthere exist universal constants A > 0 and V > 0 such that for anyFn ⊂ F with envelope function ‖Fn‖,

supQ

N1 (εQFn, Q,Fn) ≤ Aε−V

supQ

N2

(ε(QF 2

n

)1/2, Q,Fn

)≤ Aε−V .

19 / 63



Lemma (2)

For a neighborhood N (θ0) around θ0, ‖F‖ = supθ∈N(θ0) |g (Zi , θ) | ≤<∞,and for all ε small enough,

supθ∈N(θ0)

E[(g(Zi , θ + ε)− g(Zi , θ − ε))2

]= O(ε), (1)

Then under assumption 2, if nεn/ log n→∞

supd(θ,θ0)=o(1)

‖Lεn1,p g (θ)− Lεn1,pg (θ) ‖ = op(1).

Consequently, assumptions 2 implies that G1

(θ)

= op (1) if

d(θ, θ0

)= op (1).

20 / 63


Theorem (2)

Under assumptions, 1, 2 and the conditions of lemma 2,

Lεn1,p g(θ)

p−→ G (θ0) if εn → 0 and nεn/ log n→∞, and if

d(θ, θ0

)= op (1).

This is the weakest possible sufficient condition that we are able tocome up with without making stronger assumptions. The case of anindicator function involved in g(Zi , θ) typically corresponds toγ = 1/2. However, this condition can be improved if we are willing toimpose the following stronger assumption, which holds for smootherfunctions such as those that are Holder-continuous.

21 / 63



Assumption (3)

The moment condition is mean square differentiable. DefineG (θ) = 1√

n

∑ni=1 (g (Zi , θ)− g (θ)). There exists some ε > 0, such that

for all δ sufficiently small,

E ∗ supd(θ1,θ2)<δ,d(θ1,θ0)<ε,d(θ2,θ0)<ε

|G (θ1)−G (θ2) | . φn (δ) ,

for functions φn (·) such that δ 7→ φn (δ) /δγ is decreasing for some γ > 0and γ ≤ 1.

22 / 63



This assumption allows us to put an envelope directly on G1.

This assumption is more stringent than Theorem 3.2.5 in Vaart andWellner, and may fail in cases when Theorem 3.2.5 holds, for examplewith indicator functions.

Define a class of functions Mεδ = g (Zi , θ1)− g (Zi , θ2) , d (θ1, θ2) ≤

δ, d (θ1, θ0) < ε, d (θ2, θ0) < ε.Assumption 3, which requires bounding E ∗P ||Gn||Mε

δ, can be obtained

by invoking the maximum inequalities in Theorems 2.14.1 and 2.14.2in Vaart and Wellner.

23 / 63



These tail bounds provide that for Mεδ an envelope function of the

class of functions Mεδ,

E ∗P ||Gn||Mεδ

. J (1,Mεδ)(

P∗ (Mεδ)2)1/2

,

E ∗P ||Gn||Mεδ

. J[] (1,Mεδ, L2 (P))

(P∗ (Mε

δ)2)1/2

,

where J (1,Mεδ) and J[] (1,Mε

δ, L2 (P)) are the uniform andbracketing entropy integrals.

The following result shows that for smooth functions g(Zi , θ) that areLipschitz in θ, the only condition needed for consistency is εn → 0.

Theorem

Under assumptions 1 and 3, Lεn1,p g(θ)

p−→ G (θ0) if εn → 0 and

nε2−2γ →∞, and if d(θ, θ0

)= op (1).

24 / 63


Numerical derivative: rate of convergence

Theorem

Under the assumption that

E ∗ supd(θ1,θ2)<δ,d(θ1,θ0)<ε,d(θ2,θ0)<ε

|G (θ1)−G (θ2) | . φn (δ ∧ ε) ,

where δ 7→ φ(δ)/δγ is decreasing for some 0 < γ < 1, if θ− θ0 = Op (n−η)for η > 0, provided that η ≥ 1

1−γ+2ν , the best rate of convergence between

Lε1,p g(θ)

and G is achieved when εn = O(n−1/(1−γ+2ν)

)in which case

||Lε1,p g(θ)− G ||2 = Op

(n−

2ν1−γ+2ν

).

25 / 63



If γ < 1, and numerical differentiation operator guarantees order ofresidual ν ≥ 1 we can have parametric rate of convergence η = 1/2

In smooth models γ = 1, there is no bias-variance tradeoff ⇒ smallerstep size εleads to smaller bias

Order of root MSE in smooth case bounded below by variance termof O

(1/√

n)

for sufficiently smaller εn

26 / 63


Example

Simple example: m (zi ; θ) = 1 (zi ≤ θ)− τ .

Numerical derivative

1n

∑ni=1

1(zi≤θ+ε)−1(zi≤θ−ε)2ε = 1

n

∑ni=1

1εU

(|zi−θ0−(θ−θ0)|

ε

).

Consistency condition in Powell (1984) and Newey, McFadden (1994)both of which require

√nε→∞, are too strong.

This is not necessary: f (x) uniformly consistent for f (x).

Only need nε/ log n→∞.

Second part of noise due to θ − θ0, θ−θ0ε , will vanish.

θ − θ0p−→ 0 ⇒ f

(θ − θ0

)p−→ f (0) = fz (θ0).

27 / 63


Numerical derivative: the sieve infinite dimensional case

Infinite dimensional θ.

Computes a directional derivative Gh = d m(θ0,η0+τh,x)d τ

∣∣∣∣τ=0

numerically:

Lεn,h1,p m(θ, η, x

)=

1

εn

∑k

akm(θ, η + tkh εn, x

)Two methods to estimate the conditional expectation: sieve (series)and kernel

m (θ, η, z) = pN′(z)

(1

n

n∑i=1

pN(zi )pN′(zi )

)−11

n

n∑i=1

pN(zi )ρ (θ, η; yi ) .

and

m (θ, η, z) =

(1

nbdzn

n∑i=1

K

(zi − z

bdzn

))−11

nbdzn

n∑i=1

K

(zi − z

bn

)ρ (θ, η; yi ) .

28 / 63


Assumption

For the basis functions pN(z) the following holds:

(i) The smallest eigenvalue of E[pN(Zi ) pN′(Zi )

]is bounded away from

zero uniformly in N.

(ii) For some C > 0, supz∈Z‖pN(z)‖ ≤ C <∞.

(iii) The population conditional moment belongs to the completion of thesieve space and

sup(θ,η)∈Θ×H

supz∈Z

∥∥∥m (θ, η, z)− proj(

m (θ, η, z) | pN(z))∥∥∥ = O

(N−α

).

29 / 63


Assumption

(i) Uniformly bounded moment functions: supθ,η‖ρ(θ, η, ·)‖ ≤ C . The

density of covariates Z is uniformly bounded from zero on its support.

(ii) Suppose that 0 ∈ Hn and for εn → 0 and some C > 0,

supz∈Z,η,w∈Hn

|η|,|w |<C ,θ∈N (θ0)

Var (ρ (θ, η + εnw ; Yi )− ρ (θ, η − εnw ; Yi ) | z) = O (εn) ,

(iii) For each n, the class of functionsFn = ρ (θ, η + εnw ; ·)− ρ (θ, η − εnw ; ·) , θ ∈ Θ, η,w ∈ Hn isEuclidean whose coefficients depend on the number of sieve terms. Inother words, there exist constants A, and 0 ≤ r0 <

12 such that the

covering number satisfies

log N (δ,Fn,L1) ≤ A n2r0 log

(1

δ

),

30 / 63


Lemma

Suppose that ρ (πnη, η) = Op

(n−φ

). Under assumptions 4 and 5

supd(θ,θ0)=o(1),d(η,η0)=o(1),η∈Hn

∣∣∣Lεn,w1,p m (θ, η, z)− Lεn,w1,p m (θ, η, z)∣∣∣ = op(1)

uniformly in z and w, provided that εn → 0 and minNα, nφεn →∞, andnεn

N2 n2r0 log n→∞.

31 / 63


Assumption

K (·) is the m-th order kernel function which is an element of the class offunctions F defined by Assumption 2. It integrates to 1, it is bounded andits square has a finite integral.

Lemma

Under assumptions 5 and 6

supd(θ,θ0)=o(1),d(η,η0)=o(1),η∈Hn

∣∣∣Lεn,w1,p m (θ, η, z)− Lεn,w1,p m (θ, η, z)∣∣∣ = op(1)

uniformly in w and z where f (z) is strictly positive for the kernel estimator

provided that εn → 0, bn → 0, εn minb−Nn , nφ → ∞ and nεnbdzn

n2r0 log n→∞.

32 / 63


Theorem

Under assumptions 1, 5, and either 4 or 6,

Lεn,w1,p m(θ, η, z

)p−→ ∂m(θ,η,z)

∂η [w ], uniformly in z and w, if N →∞,

εn minNα, nφ → ∞, and nεnN2 n2r0 log n

→∞ for series estimator, and

bn → 0, εn minb−Nn , nφ → ∞, and nεnbdzn

n2r0 log n→∞ for kernel-based

estimator, provided that d(θ, θ0

)= op (1) and d (η, η0) = op (1).

33 / 63


Holder-continuous moment functions

ASSUMPTION 5.

(i’) For any sufficiently small ε > 0

sup(θ,η)∈Θ×H,w∈H,|w |<C

‖ρ (θ, η + wε, z)− ρ (θ, η + wε, z)‖ ≤ C (z)εγ ,

where 0 < γ ≤ 1 and E[C (Z )2

]<∞.

34 / 63


Lemma

Suppose that ρ (πnη, η) = O(n−φ

). Under either pair of assumptions 4

and 5(i’),(ii),(iii), (iv) or 6 and 5 (i’), (ii),(iii),(iv)

supd(θ,θ0)=op(1),d(η,η0)=op(1),η∈Hn

∣∣∣Lεn,w1,p m(θ, η, z

)− Lεn,w1,p m (θ0, η0, z)

∣∣∣ = op(1)

uniformly in z and w, provided that εn → 0, εn minNα, nφ → ∞ and√n ε1−γ

n

N nr0 →∞ for series estimator, and bn → 0, εn minb−qn , nφ → ∞,n1−2r0ε2−2γ

n bdznlog n →∞ for kernel estimator.

35 / 63

Gradient-based estimation with numerical derivatives

Definitions

Consider a form of Z-estimator.

Estimate parameter θ0 in metric space (Θ, d).

First-order condition hard to compute analytically.

Sample objective function is

Qn (θ) =1

n

n∑i=1

g (zi , θ) .

Estimator θn ∈ Θ solves empirical first-order condition:

‖Dn

(θn

)‖ = ‖Lεn1,p Qn

(θn

)‖ = op

(1√n

).

36 / 63


Assumptions

An identification condition

Assumption

The map Θ 7→ Rk defined by D (θ) = ∂∂θE [g (zi , θ)] is identified at

θ0 ∈ Θ. In other words from limn→∞

‖D (θn) ‖ = 0 it follows that

limn→∞

‖θn − θ0‖ = 0 for any sequence θn ∈ Θ.

A “continuity in probability” condition

Assumption

The parameter space Θ has a compact cover. For each n, there exists acountable subset Tn ⊂ Θ such that

P∗(

supθ∈Θ

infθ′∈Tn

‖g (Zi , θ)− g(Zi , θ

′) ‖2 > 0

)= 0.

37 / 63


Consistency

Functions with absolutely bounded finite differences

Theorem

Under assumptions 7, 8, 1, and 2, as long as εn → 0 and nεnlog n →∞,

supθ∈Θ||Lεn1,pQ (θ)− G (θ) || = op (1) .

Consequently, θp−→ θ0 if ||Lεn1,pQ

(θ)|| = op (1).

The Holder” continuous case is also similar to variance estimation.

38 / 63


Rate of convergence

Lemma 1 also provides the rates of convergence: For θp−→ θ0

supθ∈N(θ0)

√nεn

log n||Lεn1,pQ (θ)− Lεn1,pQ (θ) || = Op(1),

Initial parameter rate of convergence:

Lemma

Suppose θp−→ θ0 and Lε1,pQ

(θ)

= op(

1√nεn

). Under Assumptions of

Theorem 10, if nεn/ log n→∞, and nε1+4p/ log n = O (1), then√nεn

log nd(θ, θ0

)= OP∗ (1).

39 / 63


Rate of convergence

Lemma

Under conditions of theorem 10 with nε1+4pn = o (1), and nε3

nlog n →∞ we

have

sup

d(θ,θ0)=O(√

log nnεn

)(Lεn1,pQ

(θ)− Lεn1,pQ (θ0)− Lεn1,pQ

(θ)

+ Lεn1,pQ (θ0))

= op

(1√nεn

).

Theorem


(θ)

= op(

1√nεn

). Under Assumptions of

Theorem 10, if nε3n/ log n→∞, and

√nε1+p = O (1), then

√nεnd

(θ, θ0

)= O∗P (1).

40 / 63


Distribution

Assumption

A CLT holds: As n→∞ and εn → 0,

Gn (θ0 + εn)−Gn (θ0 − εn)√εn

d−→ N (0,Ω) .

Theorem

Assume that the conditions of theorem 10 hold but with nε1+4pn = o (1).

In addition, suppose that the Hessian matrix H (θ) of g (θ) is continuous,nonsingular and finite at θ0. Then if Assumption 9 holds with nε3

n →∞

√nεn

(θ − θ0

)d−→ N

(0,H (θ0)−1 ΩH (θ0)−1

).

41 / 63


Functions with polynomial envelopes for finite differences

Theorem


(θ)

= op(

1√nε1−γ

n

). Under Assumptions 3

and 8, if nε2−2γn →∞ and

√nε1−γ+2p

n = O (1), and suppose that theHessian matrix H (θ) of g (θ) is continuous, nonsingular and finite at θ0,

then√

nε1−γn d

(θ, θ0

)= O∗P (1).

Theorem

Assume that the conditions of theorem 15 hold but with√nε1+2p−γ

n = o (1). If limε→0 ε2−2γVar

(Lε1,pg (Zi , θ0)

)= Ω, and if,

√nε2−γ

n →∞. Then

√nε1−γ

n

(θ − θ0

)d−→ N

(0,H (θ0)−1 ΩH (θ0)−1

).

42 / 63


Stronger assumptions for smooth models

Proposition

Suppose the conditions of theorem 16 hold except√

nε2−γn →∞. Suppose

further that g (zi , θ) is mean square differentiable in a neighborhood of θ0:for measurable functions D (·, ·) : Z ×Θ→ Rp such that

E[g (Z , θ1)− g (Z , θ2)− (θ2 − θ1)′D(Z , θ1)

]2= o

(‖θ1 − θ2‖2

),

E‖D (Z , θ1) ‖2 <∞ for all θ1, and θ2 ∈ Nθ0 . Defineqε (zi , θ) = Lε1,pg (zi ; θ)− D (z , θ) , Assume that

supd(θ,θ0)=o(1),ε=o(1)

[Gqε (zi , θ1)−Gqε (zi , θ0)] = op (1) ,

and D (zi , θ) is Donsker in d (θ, θ0) ≤ δ, then the conclusion of theorem16 holds.

Example: quantile regression.

43 / 63


Example

Our theory does not require smoothness of moment function, andapplies to non-smooth objective functions.

Example: Maximum score

Simple case - one regressor with normalized coefficient. Objectivefunction

Qn (θ) =1

n

n∑i=1

(yi −

1

2

)1θ + x ′i > 0

.

Numerical first-order condition - set numerical gradient equal to zero

Lεn1,2Qn (θ) =1

n

n∑i=1

(yi −

1

2

)1

εnU

(xi + θ

εn

)= 0,

where U(·) is a uniform kernel.

44 / 63


Example

comparison with smoothed maximum score

Use uniform kernel

K (z) =1

2(1z > −1+ 1z ∈ [−1, 1]z + 1z > 1) .

Smoothed objective function

Qsn (θ) =

1

n

n∑i=1

(yi −

1

2

)K

(xi + θ

hn

).

use bandwidth hn

First-order condition

∂

∂θQs

n (θ) =1

n

n∑i=1

(yi −

1

2

)1

hnU

(xi + θ

hn

)= 0.

This is identical to equating numerical gradient to zero!

45 / 63


Example

Both numerical gradient-based procedure and smoothed matimumscore have same convergence rate

√n εn

Step size parameter is equivalent to bandwidth

If one uses any gradient-search method with maximum score-typeobjective functions, the asymptotic distribution of Kim and Pollard(1990) is not applicable.

46 / 63

Monte-Carlo Simulations

Setup

Analyze minimizers of

Qn (θ) =1

n

n∑i=1

|xi − θ|α.

In populationQ (θ) = E [|xi − θ|α] .

xi ∼ N (0, 1)

Estimator solves

Mn

(θ)

=1

n

n∑i=1

(1

xi ≤ θ− 1

2

)|xi − θ|α−1 = 0.

47 / 63


Problem

Solution well-defined if α ≥ 1. α = 1 gives LAD.

Look at numerical derivative of Mn

(θ)

(Hessian of Sample objective

function).

Set θ0 = 0.

Population Hessian can be computed analytically

D (0) = −1α = 1φ(0)− α− 1

4√π

2α/2Γ

(α− 1

2

).

Compare Lε1,pMn

(θ)

with D (0).

48 / 63


Methodology

First-order formula

D1

(θ)

= Lεn1,1 =Mn

(θ + εn

)− Mn

(θ)

εn.

Second-order formula

D2

(θ)

= Lεn1,2 =Mn

(θ + εn

)− Mn

(θ − εn

)2εn

,

Third-order formula is

D3

(θ)

= Lεn1,3 =−Mn(θ−2εn)+8Mn(θ−εn)−8Mn(θ+εn)+Mn(θ+2εn)

12εn.

49 / 63


Methodology

Choose α = 2.5, 1.5 and 1.

Numerical and analytic derivatives coincide when α = 2.

Use different rates for step size of numerical differentiation.

Use different order formulas.

50 / 63


Methodology

We generate 1000 Monte-Carlo samples with the number ofobservations from 500 to 4000.

For each sample s we find the estimate θs by solving

Mns

(θs)

= 0.

We choose sample-adaptive step of numerical differentiation asε = C (ns)−q. We choose C = 2 and q from 0.125 to 1.

51 / 63


Methodology

Then we evaluate the numerical Hessians D1

(θs)

, D2

(θs)

, and

D3

(θs)

.

The values of numerical Hessians are stored and then we compute themean-squared error of the evaluated Hessians across Monte-Carlosamples:

MSE(

Di

)=

√√√√ 1

S

S∑s=1

(Di

(θs)− D(0)

)2

The results are reported by showing the dependence of MSE(

Di

)from the sample size ns .

52 / 63


Results

α = 2.5: smooth objective function.

Decreasing the rate of the step size and increasing the order ofnumerical derivative lead to a decrease in the mean-squared error inevaluation of the derivative.

α = 1: least smooth. The optimal step size rate is close to .2.

Increase in the order of numerical differentiation results in smallincrease in the precision of the derivative evaluation.

α = 1.5: intermediate case.

While an increase in the order of the numerical derivative leads to adecrease in the mean-squared error, the mean squared error tends todecrease when q increases, then it starts to increase again.

53 / 63

Conclusion

Results

Sufficient conditions for consistency of numerical derivatives.

Problem is related to but different from bias-variance tradeoff.

In cases with smooth objective functions, step size can decrese atarbitrary fast rate.

Gradient-based numerical optimization techniques can be appliedeven with nonsmooth sample objective provided that step size isselected properly.

Rate of convergence of functions containing numerical differentiationrestricts the order of numerical differentiation operator.

Monte-Carlo analysis illustrates theoretical step size conditions.

54 / 63

Simulations

500 1000 1500 2000 2500 3000 3500 40000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

sample size

RMSE

1st order derivative, alpha=2.500000e+00

12345678

55 / 63

Simulations

500 1000 1500 2000 2500 3000 3500 40000.005

0.01

0.015

0.02

0.025

0.03

sample size

RMSE

1st order derivative, alpha=1.500000e+00

12345678

56 / 63

Simulations

500 1000 1500 2000 2500 3000 3500 40000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

sample size

RMSE

1st order derivative, alpha=1

12345678

57 / 63

Simulations

500 1000 1500 2000 2500 3000 3500 40000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

sample size

RMSE

2nd order derivative, alpha=2.500000e+00

12345678

58 / 63

Simulations

500 1000 1500 2000 2500 3000 3500 40000

0.005

0.01

0.015

0.02

0.025

0.03

sample size

RMSE

2nd order derivative, alpha=1.500000e+00

12345678

59 / 63

Simulations

500 1000 1500 2000 2500 3000 3500 40000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

sample size

RMSE

2nd order derivative, alpha=1

12345678

60 / 63

Simulations

500 1000 1500 2000 2500 3000 3500 40002

4

6

8

10

12

14x 10

−3

sample size

RMSE

3rd order derivative, alpha=2.500000e+00

12345678

61 / 63

Simulations

500 1000 1500 2000 2500 3000 3500 40000.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

0.022

0.024

sample size

RMSE

3rd order derivative, alpha=1.500000e+00

12345678

62 / 63

Simulations

500 1000 1500 2000 2500 3000 3500 40000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

sample size

RMSE

3rd order derivative, alpha=1

12345678

63 / 63

statistical properties of numerical derivativesdoubleh/eco273/presentation...consistency of...

Documents