statistical properties of numerical derivativesdoubleh/eco273/presentation...consistency of...
TRANSCRIPT
Statistical Properties of Numerical Derivatives
Han Hong, Aprajit Mahajan, and Denis Nekipelov
Stanford University and UC Berkeley
November 2010
1 / 63
Introduction
Motivation
Many models have objective functions that are intensive to compute.
Sample objective functions might not have well-behaved analyticderivatives.
Common sense solution: use numerical derivatives (Judd (1998)).
Statistical problem:
Extra nuissance parameter (step size of numerical differentiation),
Not clear whether results from estimators with nuissance parameters(e.g. bandwidth) translate directly to the case of numerical derivative.
Can we obtain good estimators by solving numerical first-orderconditions?
Can we provide relatively weak conditions on step size if we are onlyinterested in value of derivative?
2 / 63
Introduction
Motivation
Example (see (Judd, 1998)): Need to compute gradient of gapproximated by g
Can use approximation with step size h:
g ′(x) = Lh1,1g(x) =
g(x + h)− g(x)
h= g ′(x) + O (h)
or
g ′(x) = Lh1,2g(x) =
g(x + h)− g(x − h)
2h= g ′(x) + O
(h2)
More generally, higher order differentiation.
Bias and variance tradeoffs.
3 / 63
Introduction
Motivation
Conclusion: careful analysis of function properties and application ofhigher order numerical derivative technique reduces error by orders ofmagnitude.
The case where g is estimated is more complicated:
g(x + h) and g(x − h) are correlated
Consistency of derivatives depends on choices of approximating formulaand step size.
Results depend on tradeoffs between (a) Quality of numericalapproximation (b) choice of step size (c) smoothness of populationobjective (d) Empirical process properties for g .
4 / 63
Introduction
Literature background
Anderssen and Bloomfield (1974): differentiation of functionsmeasured with noise
Newey and McFadden (1994): sufficient conditions for convergence ofnumerical Hessian
Newey (1994), Newey and McFadden (1994): numerical Hessian insemiparametric model.
L’Ecuyer and Perron (1994): convergence rate and asymptotics forbasic finite-difference formulas for smooth functions
Powell (1984): Censored LAD.
Buchinsky and Hahn (1998): Alternative censored QR.
Simulation estimation of nonsmooth models (Pakes and Pollard(1989)).
Maximum score and smoothing (Manski (1975), Horowitz (1991)).
5 / 63
Introduction
What we find
Find convergence rates and statistical properties of numericalderivatives of possibly non-smooth objective functions.
While in some cases choice of step of numerical differentiation isbased on bias-variance tradeoff, in smooth classes of models there willbe no such tradeoff.
First-order conditions using numerical derivatives can be still usedeven if the sample objective function is not smooth (e.g. maximumscore) provided that step size is selected correctly.
Smoothness of objective function influences the precise convergencerate of derivative estimator and its limit distribution.
Requirements on rate of convergence of the numerical derivativerestrict the order of numerical differentiation operator.
6 / 63
Introduction
Outline of the talk
Consistency of numerical partial derivatives in parametric andsemiparametric models.
Sufficient conditions for consistency
Rates of convergence and distribution theory
M-estimators based on numerical first-order conditions.
Consistency of estimators
Convergence rate and optimal choice of step of numericaldifferentiation
Monte-Carlo evidence.
7 / 63
Introduction
Numerical derivative: definition
Consider function f (·) with p continuous derivatives.
We consider a class of linear operators L defined by step size ε > 0,order m ∈ N, weights ak ∈ [−1, 1] and tk ∈ [−1, 1] for k = 1, . . . ,Ksuch that
L f (x0) =1
εm
K∑k=1
ak f (x0 + ε tk) .
We call Lεm,q the m-th numerical derivative operator of orderq ≤ p −m + 1 if
Lεm,q g (x0) = g (m) (x0) + o (εq) .
8 / 63
Introduction
Numerical derivative: structure
To obtain weights use Taylor’s expansion for small ε
f (x0 + ε tk) =
p∑i=0
f (i) (x0)
i !εi t ik + o (εp)
Solve the system of equations
1
i !
K∑k=1
ak t ik = δi ,m
with boundary conditions t1 = −1, and tK = 1 where δi ,m is theKronecker symbol.
Example:
D3
(θ)
= Lεn1,3 =−Mn(θ−2εn)+8Mn(θ−εn)−8Mn(θ+εn)+Mn(θ+2εn)
12εn.
9 / 63
Consistency of numerical derivatives
Statement of the problem
Parameter vectors (θ, h(·)).
θ: Euclidean components.
h (·): infinite-dimensional components.
Moment condition
m (z ; θ0, h0(·)) = E [ρ (x ; θ0, h0(·)) | z ] = 0,
where ρ(·) can possibly be non-smooth, e.g. Ai and Chen (2003),Chen and Pouzo (2009).
Analytic expression might not exist for either derivative w.r.t. θ ordirectional derivative w.r.t. h(·) or both.
Remark: even if ρ(·) is linear, there might be no analytic expressionfor the derivative, e.g.
ρ (x ; θ0, h0(·)) = h0(z)− α h0(x)− f (x , z ; θ) .
10 / 63
Consistency of numerical derivatives
Statement of the problem
Conditional moment might not be pointwise differentiable.
Conditional moment is “sufficiently smooth” in mean squares.
E
[∥∥∥∥m (z ; θ, h(·))−∆1θ (θ − θ0)−∆1h[δ]− . . .
−∑
p1+p2=p∆p,θp1 ,hp2 [δ]p1 (θ − θ0)p2
∥∥∥∥2]= o
(‖δ‖2ν
2 + ‖θ − θ0‖2ν),
Can still use numerical derivative operator, for example,
Dwj (z) =m(
z ; θ + ejεn, h(·))− m
(z ; θ − ejεn, h(·)
)2εn
−m(
z ; θ, h(·) + τn wj(·))− m
(z ; θ, h(·)− τn wj(·)
)2τn
,
11 / 63
Consistency of numerical derivatives
Statement of the problem
Need to compute numerically, for each wj ,
Dwj (z) =∂m (z ; θ, h(·))
∂θj− ∂m (z ; θ, h(·))
∂h[wj ],
The asymptotic variance depends on w∗j , which solves
minwj∈W
Dwj (z)′Σ0(z)−1 Dwj (z),
and Σ0(z) = E[ρ (x ; θ0, h0(·)) ρ (x ; θ0, h0(·))′ | z
].
Use h(·), and θ, and numerically evaluated w∗ to estimate theasymptotic variance.
12 / 63
Consistency of numerical derivatives
Numerical derivative: problem
This structure extends to the case where f is a functional and X isfunctional space, for taking directional derivatives.
In particular, consider first derivative of order q:
Dw∗j(z) = Lεn,θ1,q m
(z ; θ, h(·)
)− Lτn,θ1,q m
(z ; θ, h(·)
).
Question 1: How to choose τn and εn for n→∞ for a derivative ofgiven order q?
Sufficient condition for step sizes of partial and directional derivativessuch that
Dw∗j(z)
p−→ Dw∗j(z).
13 / 63
Consistency of numerical derivatives
Numerical derivative: solution
Conditional moment might not be pointwise differentiable but hascontinuous L2 derivatives.
High-level assumption for convergence of conditional expectation:
sup(θ, h(·))∈Uγ
n1/k‖m(z; θ, h(·))−m(z; θ, h(·))−m(z; θ0, h0(·))‖1+n1/k‖m(z; θ, h(·))‖+n1/k‖m(z; θ, h(·))‖ = op(1),
for any sequence of shrinking neighborhoods Uγ as γ → 0.
More precise conditions can be given for particular estimator of m (·).
Nonparametric rate for h (·), for some k1:
n1/k1
∥∥∥h(·)− h0(·)∥∥∥ = Op(1), and n1/2
∥∥∥θ − θ0
∥∥∥ = Op(1).
14 / 63
Consistency of numerical derivatives
Numerical derivative: solution
Theorem (1)
Under provided assumptions, if
εn n1/maxk, k1 →∞ εn → 0
andτn n1/maxk, k1 →∞ τn → 0
thensupz∈Z|Dw (z)− Dw (z) | p−→ 0
.
15 / 63
Consistency of numerical derivatives
Numerical derivative: discussion
This result follows closely existing results in Newey and McFadden(1994) and Powell (1984)
Provided conditions imply the rate slower than 1/√
n for εn inparametric case
In semiparametric case need stronger condition to control for slownon-parametric rate
Such sufficient conditions are too strong
Sharper results can be derived. Even for nonsmooth models, onlyneed nε→∞. in the parametric case.
16 / 63
Consistency of numerical derivatives
Numerical derivative: parametric case
Can derive sharper results if know more detail about moment function
Moment vector
g (θ) =1
n
n∑i=1
g (Zi , θ) ,
estimator θ equates it to zero
Need to estimate G = ∂g(θ0)∂θ using Lεn1,p g
(θ)
Decompose Lεn1,p g(θ)− G = G1 + G2 + G3
G1 =[Lεn1,p g
(θ)− Lεn1,pg
(θ)]
G2 = Lεn1,pg(θ)− G
(θ)
G3 = G(θ)− G .
G3 = Op
(θ − θ0
), but it has no relation to εn.
17 / 63
Consistency of numerical derivatives
Numerical derivative: parametric case
Assumption (1)
A 2p + 1th order mean value expansion applies to the limiting functiong (θ) uniformly in a neighborhood of θ0. For all ε sufficiently small andr = 2p + 1,
supθ∈N (θ0)
∣∣∣∣g (θ)−r∑
l=0
εl
l!g (l) (θ)
∣∣∣∣ = O(εr+1
).
The bias term G2
(θ)
can be controlled if the bias reduction is
uniformly small in a neighborhood of θ0.
An immediate consequence of this assumption is that
G2
(θ)
= O(ε2p).
18 / 63
Consistency of numerical derivatives
Numerical derivative: parametric case
The weakest possible condition to control G1
(θ)
that covers all the
models that we are aware of seems to come from a convergence rateresult in Pollard (1984).
Assumption (2)
Define F = g(·, θ), θ ∈ Θ. The graphs of functions in F has polynomialdegrees of discrimination.
Most of the functions in econometric applications fall in this category.By lemmas 25 and 36 of Pollard (1984), assumption 2 implies thatthere exist universal constants A > 0 and V > 0 such that for anyFn ⊂ F with envelope function ‖Fn‖,
supQ
N1 (εQFn, Q,Fn) ≤ Aε−V
supQ
N2
(ε(QF 2
n
)1/2, Q,Fn
)≤ Aε−V .
19 / 63
Consistency of numerical derivatives
Numerical derivative: parametric case
Lemma (2)
For a neighborhood N (θ0) around θ0, ‖F‖ = supθ∈N(θ0) |g (Zi , θ) | ≤<∞,and for all ε small enough,
supθ∈N(θ0)
E[(g(Zi , θ + ε)− g(Zi , θ − ε))2
]= O(ε), (1)
Then under assumption 2, if nεn/ log n→∞
supd(θ,θ0)=o(1)
‖Lεn1,p g (θ)− Lεn1,pg (θ) ‖ = op(1).
Consequently, assumptions 2 implies that G1
(θ)
= op (1) if
d(θ, θ0
)= op (1).
20 / 63
Consistency of numerical derivatives
Theorem (2)
Under assumptions, 1, 2 and the conditions of lemma 2,
Lεn1,p g(θ)
p−→ G (θ0) if εn → 0 and nεn/ log n→∞, and if
d(θ, θ0
)= op (1).
This is the weakest possible sufficient condition that we are able tocome up with without making stronger assumptions. The case of anindicator function involved in g(Zi , θ) typically corresponds toγ = 1/2. However, this condition can be improved if we are willing toimpose the following stronger assumption, which holds for smootherfunctions such as those that are Holder-continuous.
21 / 63
Consistency of numerical derivatives
Numerical derivative: parametric case
Assumption (3)
The moment condition is mean square differentiable. DefineG (θ) = 1√
n
∑ni=1 (g (Zi , θ)− g (θ)). There exists some ε > 0, such that
for all δ sufficiently small,
E ∗ supd(θ1,θ2)<δ,d(θ1,θ0)<ε,d(θ2,θ0)<ε
|G (θ1)−G (θ2) | . φn (δ) ,
for functions φn (·) such that δ 7→ φn (δ) /δγ is decreasing for some γ > 0and γ ≤ 1.
22 / 63
Consistency of numerical derivatives
Numerical derivative: parametric case
This assumption allows us to put an envelope directly on G1.
This assumption is more stringent than Theorem 3.2.5 in Vaart andWellner, and may fail in cases when Theorem 3.2.5 holds, for examplewith indicator functions.
Define a class of functions Mεδ = g (Zi , θ1)− g (Zi , θ2) , d (θ1, θ2) ≤
δ, d (θ1, θ0) < ε, d (θ2, θ0) < ε.Assumption 3, which requires bounding E ∗P ||Gn||Mε
δ, can be obtained
by invoking the maximum inequalities in Theorems 2.14.1 and 2.14.2in Vaart and Wellner.
23 / 63
Consistency of numerical derivatives
Numerical derivative: parametric case
These tail bounds provide that for Mεδ an envelope function of the
class of functions Mεδ,
E ∗P ||Gn||Mεδ
. J (1,Mεδ)(
P∗ (Mεδ)2)1/2
,
E ∗P ||Gn||Mεδ
. J[] (1,Mεδ, L2 (P))
(P∗ (Mε
δ)2)1/2
,
where J (1,Mεδ) and J[] (1,Mε
δ, L2 (P)) are the uniform andbracketing entropy integrals.
The following result shows that for smooth functions g(Zi , θ) that areLipschitz in θ, the only condition needed for consistency is εn → 0.
Theorem
Under assumptions 1 and 3, Lεn1,p g(θ)
p−→ G (θ0) if εn → 0 and
nε2−2γ →∞, and if d(θ, θ0
)= op (1).
24 / 63
Consistency of numerical derivatives
Numerical derivative: rate of convergence
Theorem
Under the assumption that
E ∗ supd(θ1,θ2)<δ,d(θ1,θ0)<ε,d(θ2,θ0)<ε
|G (θ1)−G (θ2) | . φn (δ ∧ ε) ,
where δ 7→ φ(δ)/δγ is decreasing for some 0 < γ < 1, if θ− θ0 = Op (n−η)for η > 0, provided that η ≥ 1
1−γ+2ν , the best rate of convergence between
Lε1,p g(θ)
and G is achieved when εn = O(n−1/(1−γ+2ν)
)in which case
||Lε1,p g(θ)− G ||2 = Op
(n−
2ν1−γ+2ν
).
25 / 63
Consistency of numerical derivatives
Numerical derivative: parametric case
If γ < 1, and numerical differentiation operator guarantees order ofresidual ν ≥ 1 we can have parametric rate of convergence η = 1/2
In smooth models γ = 1, there is no bias-variance tradeoff ⇒ smallerstep size εleads to smaller bias
Order of root MSE in smooth case bounded below by variance termof O
(1/√
n)
for sufficiently smaller εn
26 / 63
Consistency of numerical derivatives
Example
Simple example: m (zi ; θ) = 1 (zi ≤ θ)− τ .
Numerical derivative
1n
∑ni=1
1(zi≤θ+ε)−1(zi≤θ−ε)2ε = 1
n
∑ni=1
1εU
(|zi−θ0−(θ−θ0)|
ε
).
Consistency condition in Powell (1984) and Newey, McFadden (1994)both of which require
√nε→∞, are too strong.
This is not necessary: f (x) uniformly consistent for f (x).
Only need nε/ log n→∞.
Second part of noise due to θ − θ0, θ−θ0ε , will vanish.
θ − θ0p−→ 0 ⇒ f
(θ − θ0
)p−→ f (0) = fz (θ0).
27 / 63
Consistency of numerical derivatives
Numerical derivative: the sieve infinite dimensional case
Infinite dimensional θ.
Computes a directional derivative Gh = d m(θ0,η0+τh,x)d τ
∣∣∣∣τ=0
numerically:
Lεn,h1,p m(θ, η, x
)=
1
εn
∑k
akm(θ, η + tkh εn, x
)Two methods to estimate the conditional expectation: sieve (series)and kernel
m (θ, η, z) = pN′(z)
(1
n
n∑i=1
pN(zi )pN′(zi )
)−11
n
n∑i=1
pN(zi )ρ (θ, η; yi ) .
and
m (θ, η, z) =
(1
nbdzn
n∑i=1
K
(zi − z
bdzn
))−11
nbdzn
n∑i=1
K
(zi − z
bn
)ρ (θ, η; yi ) .
28 / 63
Consistency of numerical derivatives
Assumption
For the basis functions pN(z) the following holds:
(i) The smallest eigenvalue of E[pN(Zi ) pN′(Zi )
]is bounded away from
zero uniformly in N.
(ii) For some C > 0, supz∈Z‖pN(z)‖ ≤ C <∞.
(iii) The population conditional moment belongs to the completion of thesieve space and
sup(θ,η)∈Θ×H
supz∈Z
∥∥∥m (θ, η, z)− proj(
m (θ, η, z) | pN(z))∥∥∥ = O
(N−α
).
29 / 63
Consistency of numerical derivatives
Assumption
(i) Uniformly bounded moment functions: supθ,η‖ρ(θ, η, ·)‖ ≤ C . The
density of covariates Z is uniformly bounded from zero on its support.
(ii) Suppose that 0 ∈ Hn and for εn → 0 and some C > 0,
supz∈Z,η,w∈Hn
|η|,|w |<C ,θ∈N (θ0)
Var (ρ (θ, η + εnw ; Yi )− ρ (θ, η − εnw ; Yi ) | z) = O (εn) ,
(iii) For each n, the class of functionsFn = ρ (θ, η + εnw ; ·)− ρ (θ, η − εnw ; ·) , θ ∈ Θ, η,w ∈ Hn isEuclidean whose coefficients depend on the number of sieve terms. Inother words, there exist constants A, and 0 ≤ r0 <
12 such that the
covering number satisfies
log N (δ,Fn,L1) ≤ A n2r0 log
(1
δ
),
30 / 63
Consistency of numerical derivatives
Lemma
Suppose that ρ (πnη, η) = Op
(n−φ
). Under assumptions 4 and 5
supd(θ,θ0)=o(1),d(η,η0)=o(1),η∈Hn
∣∣∣Lεn,w1,p m (θ, η, z)− Lεn,w1,p m (θ, η, z)∣∣∣ = op(1)
uniformly in z and w, provided that εn → 0 and minNα, nφεn →∞, andnεn
N2 n2r0 log n→∞.
31 / 63
Consistency of numerical derivatives
Assumption
K (·) is the m-th order kernel function which is an element of the class offunctions F defined by Assumption 2. It integrates to 1, it is bounded andits square has a finite integral.
Lemma
Under assumptions 5 and 6
supd(θ,θ0)=o(1),d(η,η0)=o(1),η∈Hn
∣∣∣Lεn,w1,p m (θ, η, z)− Lεn,w1,p m (θ, η, z)∣∣∣ = op(1)
uniformly in w and z where f (z) is strictly positive for the kernel estimator
provided that εn → 0, bn → 0, εn minb−Nn , nφ → ∞ and nεnbdzn
n2r0 log n→∞.
32 / 63
Consistency of numerical derivatives
Theorem
Under assumptions 1, 5, and either 4 or 6,
Lεn,w1,p m(θ, η, z
)p−→ ∂m(θ,η,z)
∂η [w ], uniformly in z and w, if N →∞,
εn minNα, nφ → ∞, and nεnN2 n2r0 log n
→∞ for series estimator, and
bn → 0, εn minb−Nn , nφ → ∞, and nεnbdzn
n2r0 log n→∞ for kernel-based
estimator, provided that d(θ, θ0
)= op (1) and d (η, η0) = op (1).
33 / 63
Consistency of numerical derivatives
Holder-continuous moment functions
ASSUMPTION 5.
(i’) For any sufficiently small ε > 0
sup(θ,η)∈Θ×H,w∈H,|w |<C
‖ρ (θ, η + wε, z)− ρ (θ, η + wε, z)‖ ≤ C (z)εγ ,
where 0 < γ ≤ 1 and E[C (Z )2
]<∞.
34 / 63
Consistency of numerical derivatives
Lemma
Suppose that ρ (πnη, η) = O(n−φ
). Under either pair of assumptions 4
and 5(i’),(ii),(iii), (iv) or 6 and 5 (i’), (ii),(iii),(iv)
supd(θ,θ0)=op(1),d(η,η0)=op(1),η∈Hn
∣∣∣Lεn,w1,p m(θ, η, z
)− Lεn,w1,p m (θ0, η0, z)
∣∣∣ = op(1)
uniformly in z and w, provided that εn → 0, εn minNα, nφ → ∞ and√n ε1−γ
n
N nr0 →∞ for series estimator, and bn → 0, εn minb−qn , nφ → ∞,n1−2r0ε2−2γ
n bdznlog n →∞ for kernel estimator.
35 / 63
Gradient-based estimation with numerical derivatives
Definitions
Consider a form of Z-estimator.
Estimate parameter θ0 in metric space (Θ, d).
First-order condition hard to compute analytically.
Sample objective function is
Qn (θ) =1
n
n∑i=1
g (zi , θ) .
Estimator θn ∈ Θ solves empirical first-order condition:
‖Dn
(θn
)‖ = ‖Lεn1,p Qn
(θn
)‖ = op
(1√n
).
36 / 63
Gradient-based estimation with numerical derivatives
Assumptions
An identification condition
Assumption
The map Θ 7→ Rk defined by D (θ) = ∂∂θE [g (zi , θ)] is identified at
θ0 ∈ Θ. In other words from limn→∞
‖D (θn) ‖ = 0 it follows that
limn→∞
‖θn − θ0‖ = 0 for any sequence θn ∈ Θ.
A “continuity in probability” condition
Assumption
The parameter space Θ has a compact cover. For each n, there exists acountable subset Tn ⊂ Θ such that
P∗(
supθ∈Θ
infθ′∈Tn
‖g (Zi , θ)− g(Zi , θ
′) ‖2 > 0
)= 0.
37 / 63
Gradient-based estimation with numerical derivatives
Consistency
Functions with absolutely bounded finite differences
Theorem
Under assumptions 7, 8, 1, and 2, as long as εn → 0 and nεnlog n →∞,
supθ∈Θ||Lεn1,pQ (θ)− G (θ) || = op (1) .
Consequently, θp−→ θ0 if ||Lεn1,pQ
(θ)|| = op (1).
The Holder” continuous case is also similar to variance estimation.
38 / 63
Gradient-based estimation with numerical derivatives
Rate of convergence
Lemma 1 also provides the rates of convergence: For θp−→ θ0
supθ∈N(θ0)
√nεn
log n||Lεn1,pQ (θ)− Lεn1,pQ (θ) || = Op(1),
Initial parameter rate of convergence:
Lemma
Suppose θp−→ θ0 and Lε1,pQ
(θ)
= op(
1√nεn
). Under Assumptions of
Theorem 10, if nεn/ log n→∞, and nε1+4p/ log n = O (1), then√nεn
log nd(θ, θ0
)= OP∗ (1).
39 / 63
Gradient-based estimation with numerical derivatives
Rate of convergence
Lemma
Under conditions of theorem 10 with nε1+4pn = o (1), and nε3
nlog n →∞ we
have
sup
d(θ,θ0)=O(√
log nnεn
)(Lεn1,pQ
(θ)− Lεn1,pQ (θ0)− Lεn1,pQ
(θ)
+ Lεn1,pQ (θ0))
= op
(1√nεn
).
Theorem
Suppose θp−→ θ0 and Lε1,pQ
(θ)
= op(
1√nεn
). Under Assumptions of
Theorem 10, if nε3n/ log n→∞, and
√nε1+p = O (1), then
√nεnd
(θ, θ0
)= O∗P (1).
40 / 63
Gradient-based estimation with numerical derivatives
Distribution
Assumption
A CLT holds: As n→∞ and εn → 0,
Gn (θ0 + εn)−Gn (θ0 − εn)√εn
d−→ N (0,Ω) .
Theorem
Assume that the conditions of theorem 10 hold but with nε1+4pn = o (1).
In addition, suppose that the Hessian matrix H (θ) of g (θ) is continuous,nonsingular and finite at θ0. Then if Assumption 9 holds with nε3
n →∞
√nεn
(θ − θ0
)d−→ N
(0,H (θ0)−1 ΩH (θ0)−1
).
41 / 63
Gradient-based estimation with numerical derivatives
Functions with polynomial envelopes for finite differences
Theorem
Suppose θp−→ θ0 and Lε1,pQ
(θ)
= op(
1√nε1−γ
n
). Under Assumptions 3
and 8, if nε2−2γn →∞ and
√nε1−γ+2p
n = O (1), and suppose that theHessian matrix H (θ) of g (θ) is continuous, nonsingular and finite at θ0,
then√
nε1−γn d
(θ, θ0
)= O∗P (1).
Theorem
Assume that the conditions of theorem 15 hold but with√nε1+2p−γ
n = o (1). If limε→0 ε2−2γVar
(Lε1,pg (Zi , θ0)
)= Ω, and if,
√nε2−γ
n →∞. Then
√nε1−γ
n
(θ − θ0
)d−→ N
(0,H (θ0)−1 ΩH (θ0)−1
).
42 / 63
Gradient-based estimation with numerical derivatives
Stronger assumptions for smooth models
Proposition
Suppose the conditions of theorem 16 hold except√
nε2−γn →∞. Suppose
further that g (zi , θ) is mean square differentiable in a neighborhood of θ0:for measurable functions D (·, ·) : Z ×Θ→ Rp such that
E[g (Z , θ1)− g (Z , θ2)− (θ2 − θ1)′D(Z , θ1)
]2= o
(‖θ1 − θ2‖2
),
E‖D (Z , θ1) ‖2 <∞ for all θ1, and θ2 ∈ Nθ0 . Defineqε (zi , θ) = Lε1,pg (zi ; θ)− D (z , θ) , Assume that
supd(θ,θ0)=o(1),ε=o(1)
[Gqε (zi , θ1)−Gqε (zi , θ0)] = op (1) ,
and D (zi , θ) is Donsker in d (θ, θ0) ≤ δ, then the conclusion of theorem16 holds.
Example: quantile regression.
43 / 63
Gradient-based estimation with numerical derivatives
Example
Our theory does not require smoothness of moment function, andapplies to non-smooth objective functions.
Example: Maximum score
Simple case - one regressor with normalized coefficient. Objectivefunction
Qn (θ) =1
n
n∑i=1
(yi −
1
2
)1θ + x ′i > 0
.
Numerical first-order condition - set numerical gradient equal to zero
Lεn1,2Qn (θ) =1
n
n∑i=1
(yi −
1
2
)1
εnU
(xi + θ
εn
)= 0,
where U(·) is a uniform kernel.
44 / 63
Gradient-based estimation with numerical derivatives
Example
comparison with smoothed maximum score
Use uniform kernel
K (z) =1
2(1z > −1+ 1z ∈ [−1, 1]z + 1z > 1) .
Smoothed objective function
Qsn (θ) =
1
n
n∑i=1
(yi −
1
2
)K
(xi + θ
hn
).
use bandwidth hn
First-order condition
∂
∂θQs
n (θ) =1
n
n∑i=1
(yi −
1
2
)1
hnU
(xi + θ
hn
)= 0.
This is identical to equating numerical gradient to zero!
45 / 63
Gradient-based estimation with numerical derivatives
Example
Both numerical gradient-based procedure and smoothed matimumscore have same convergence rate
√n εn
Step size parameter is equivalent to bandwidth
If one uses any gradient-search method with maximum score-typeobjective functions, the asymptotic distribution of Kim and Pollard(1990) is not applicable.
46 / 63
Monte-Carlo Simulations
Setup
Analyze minimizers of
Qn (θ) =1
n
n∑i=1
|xi − θ|α.
In populationQ (θ) = E [|xi − θ|α] .
xi ∼ N (0, 1)
Estimator solves
Mn
(θ)
=1
n
n∑i=1
(1
xi ≤ θ− 1
2
)|xi − θ|α−1 = 0.
47 / 63
Monte-Carlo Simulations
Problem
Solution well-defined if α ≥ 1. α = 1 gives LAD.
Look at numerical derivative of Mn
(θ)
(Hessian of Sample objective
function).
Set θ0 = 0.
Population Hessian can be computed analytically
D (0) = −1α = 1φ(0)− α− 1
4√π
2α/2Γ
(α− 1
2
).
Compare Lε1,pMn
(θ)
with D (0).
48 / 63
Monte-Carlo Simulations
Methodology
First-order formula
D1
(θ)
= Lεn1,1 =Mn
(θ + εn
)− Mn
(θ)
εn.
Second-order formula
D2
(θ)
= Lεn1,2 =Mn
(θ + εn
)− Mn
(θ − εn
)2εn
,
Third-order formula is
D3
(θ)
= Lεn1,3 =−Mn(θ−2εn)+8Mn(θ−εn)−8Mn(θ+εn)+Mn(θ+2εn)
12εn.
49 / 63
Monte-Carlo Simulations
Methodology
Choose α = 2.5, 1.5 and 1.
Numerical and analytic derivatives coincide when α = 2.
Use different rates for step size of numerical differentiation.
Use different order formulas.
50 / 63
Monte-Carlo Simulations
Methodology
We generate 1000 Monte-Carlo samples with the number ofobservations from 500 to 4000.
For each sample s we find the estimate θs by solving
Mns
(θs)
= 0.
We choose sample-adaptive step of numerical differentiation asε = C (ns)−q. We choose C = 2 and q from 0.125 to 1.
51 / 63
Monte-Carlo Simulations
Methodology
Then we evaluate the numerical Hessians D1
(θs)
, D2
(θs)
, and
D3
(θs)
.
The values of numerical Hessians are stored and then we compute themean-squared error of the evaluated Hessians across Monte-Carlosamples:
MSE(
Di
)=
√√√√ 1
S
S∑s=1
(Di
(θs)− D(0)
)2
The results are reported by showing the dependence of MSE(
Di
)from the sample size ns .
52 / 63
Monte-Carlo Simulations
Results
α = 2.5: smooth objective function.
Decreasing the rate of the step size and increasing the order ofnumerical derivative lead to a decrease in the mean-squared error inevaluation of the derivative.
α = 1: least smooth. The optimal step size rate is close to .2.
Increase in the order of numerical differentiation results in smallincrease in the precision of the derivative evaluation.
α = 1.5: intermediate case.
While an increase in the order of the numerical derivative leads to adecrease in the mean-squared error, the mean squared error tends todecrease when q increases, then it starts to increase again.
53 / 63
Conclusion
Results
Sufficient conditions for consistency of numerical derivatives.
Problem is related to but different from bias-variance tradeoff.
In cases with smooth objective functions, step size can decrese atarbitrary fast rate.
Gradient-based numerical optimization techniques can be appliedeven with nonsmooth sample objective provided that step size isselected properly.
Rate of convergence of functions containing numerical differentiationrestricts the order of numerical differentiation operator.
Monte-Carlo analysis illustrates theoretical step size conditions.
54 / 63
Simulations
500 1000 1500 2000 2500 3000 3500 40000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
sample size
RMSE
1st order derivative, alpha=2.500000e+00
12345678
55 / 63
Simulations
500 1000 1500 2000 2500 3000 3500 40000.005
0.01
0.015
0.02
0.025
0.03
sample size
RMSE
1st order derivative, alpha=1.500000e+00
12345678
56 / 63
Simulations
500 1000 1500 2000 2500 3000 3500 40000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
sample size
RMSE
1st order derivative, alpha=1
12345678
57 / 63
Simulations
500 1000 1500 2000 2500 3000 3500 40000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
sample size
RMSE
2nd order derivative, alpha=2.500000e+00
12345678
58 / 63
Simulations
500 1000 1500 2000 2500 3000 3500 40000
0.005
0.01
0.015
0.02
0.025
0.03
sample size
RMSE
2nd order derivative, alpha=1.500000e+00
12345678
59 / 63
Simulations
500 1000 1500 2000 2500 3000 3500 40000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
sample size
RMSE
2nd order derivative, alpha=1
12345678
60 / 63
Simulations
500 1000 1500 2000 2500 3000 3500 40002
4
6
8
10
12
14x 10
−3
sample size
RMSE
3rd order derivative, alpha=2.500000e+00
12345678
61 / 63
Simulations
500 1000 1500 2000 2500 3000 3500 40000.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0.022
0.024
sample size
RMSE
3rd order derivative, alpha=1.500000e+00
12345678
62 / 63
Simulations
500 1000 1500 2000 2500 3000 3500 40000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
sample size
RMSE
3rd order derivative, alpha=1
12345678
63 / 63