fast rates of convergence for learning problems...prof. john duchi bounding the local rademacher...

24
Fast rates of convergence for learning problems John Duchi Prof. John Duchi

Upload: others

Post on 23-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Fast rates of convergence for learning problems

    John Duchi

    Prof. John Duchi

  • Outline

    I Mean estimation and uniform laws

    II Convexity

    III Growth conditions and fast rates

    Prof. John Duchi

  • Best rates from uniform laws

    What do our uniform laws give us?

    bh = argminh2H

    (

    bLn

    (h) =1

    n

    n

    X

    i=1

    `(h;Xi

    )

    )

    and

    sup

    h2H|

    bLn

    (h)� L(h)| . 1pn

    with high probability

    Best this can be?

    L(bh)� L(h?) = bLn

    (

    bh)� bLn

    (h?) + L(bh)� bLn

    (

    bh) + bLn

    (h?)� bLn

    (h?)

    Prof. John Duchi

  • The best rate from this approach

    Central limit theorems: Consider the third error term involving h?:

    p

    n⇣

    L(h?)� bLn

    (h?)⌘

    =

    1

    p

    n

    n

    X

    i=1

    [L(h?)� `(h?;Xi

    )]

    d N (0,Var(`(h?;X)))

    Is this right?

    Prof. John Duchi

  • Estimating a mean

    Goal: We want to estimate ✓? = E[X], use loss

    `(✓;x) =1

    2

    (✓ � x)2

    with risk

    L(✓) =1

    2

    E[(✓ �X)2] = 12

    E[(✓ � E[X] + E[X]�X)2]

    =

    1

    2

    (✓ � E[X])2 +Var(X).

    Gap in risks: Subtracting we have

    L(✓)� L(✓?) =1

    2

    (✓ � ✓?)2 +Var(X)�Var(X) =1

    2

    (✓ � ✓?)2

    Prof. John Duchi

  • Estimating a sub-Gaussian mean

    Let Xi

    be independent �2-sub-Gaussian, so that

    b✓n

    =

    1

    n

    n

    X

    i=1

    Xi

    = argmin

    bLn

    (✓)

    and for t � 0 we have

    P(|b✓n

    � ✓?| � t) 2 exp

    nt2

    2�2

    LemmaWith probability at least 1� �, we have

    L(b✓n

    )� L(✓?) =1

    2

    (

    b✓n

    � ✓?)2 C�2 log 1

    n.

    Prof. John Duchi

  • Uniform law for means?

    sup

    ✓2R

    n

    bLn

    (✓)� L(✓)o

    = +1

    Prof. John Duchi

  • Convexity: heuristic and graphical explanation

    Prof. John Duchi

  • Convexity: definitions

    DefinitionA function f : Rd ! R is convex if

    f(�u+ (1� �)v) �f(u) + (1� �)f(v)

    for all u, v 2 Rd and � 2 [0, 1]

    Prof. John Duchi

  • Basic properties

    A few properties of convex functions

    I If f : R ! R is twice di↵erentiable, then f is convex if andonly if f 00(t) � 0

    I If f : Rm ! R is convex and A 2 Rm⇥n, b 2 Rm, theng(x) := f(Ax+ b) is convex

    I If f1, f2 are convex, then f1 + f2 is convex

    Prof. John Duchi

  • Examples

    ExampleLogistic loss: �(t) = log(1 + e�t) and`(✓;x, y) = log(1 + e�yxT ✓) = �(yxT ✓)

    ExampleAny norm k·k.

    Example`1-regularized linear regression:

    1

    2nkX✓ � yk22 + � k✓k1 .

    Prof. John Duchi

  • Convex functions have no local minima

    TheoremLet B = {✓ : k✓k 1}, S = {✓ : k✓k = 1}, suppose f is convexand satisfies

    f(✓) � f(✓?) for ✓ 2 ✓? + ✏S.

    For ✓ 62 ✓? + ✏B, define

    ✓✏

    :=

    k✓ � ✓?k✓ +

    1�

    k✓ � ✓?k

    ✓?

    Then

    f(✓)� f(✓?) �k✓ � ✓?k

    ✏[f(✓

    )� f(✓?)]

    Prof. John Duchi

  • Proof of theorem

    Note that ✓✏

    2 ✓? + ✏S, so for t = ✏k✓�✓?k 1,

    ✓✏

    =

    k✓ � ✓?k✓ +

    1�

    k✓ � ✓?k

    ✓? = t✓ + (1� t)✓?,

    andf(✓

    ) tf(✓) + (1� t)f(✓?)

    Prof. John Duchi

  • Convex loss functions

    Suppose that we use a convex loss, i.e. `(✓;X) is convex in ✓.Then

    bLn

    (✓) > bLn

    (✓?) for all ✓ 2 ✓? + ✏S

    implies that

    b✓n

    = argmin

    bLn

    (✓) satisfies kb✓n

    � ✓?k ✏

    Prof. John Duchi

  • A picture of how we achieve fast rates

    Prof. John Duchi

  • Growth and smoothness

    Let us fix some radius r > 0, and assume

    [Growth] L(✓) � L(✓?) +�

    2

    k✓ � ✓?k2 for k✓ � ✓?k r

    and that

    [Smoothness] `(·;x) is M -Lipschitz on {✓ : k✓ � ✓?k r}

    Example (linear regression with bounded x)

    Prof. John Duchi

  • Fast rates under growth conditions

    TheoremLet the conditions on growth and smoothness hold, define

    := {✓ 2 ⇥ | k✓ � ✓?k ✏}

    and the localized Rademacher complexity

    Rn

    (⇥

    ) := E"

    sup

    ✓2⇥✏

    1

    n

    n

    X

    i=1

    "i

    [`(✓;Xi

    )� `(✓?;Xi

    )]

    #

    Fix t � 0 and choose any ✏ r such that

    �✏2

    2

    � 2Rn

    (⇥

    ) +

    p

    2

    Mp

    nt · ✏.

    Then

    P⇣

    k

    b✓ � ✓k � ✏⌘

    2e�nt2

    Prof. John Duchi

  • Bounding the local Rademacher complexity

    Under the conditions of the theorem, for ⇥ ⇢ Rd we have

    k

    b✓ � ✓k CM

    �p

    n

    p

    d+ t⌘

    w.p. � 1� 2e�nt2.

    Prof. John Duchi

  • Bounding the local Rademacher complexity

    Under the conditions of the theorem, for ⇥ ⇢ Rd we have

    L(b✓)� L(✓?) O(stu↵) log 1

    nw.p. � 1� �.

    Prof. John Duchi

  • Multiclass classification

    Suppose we have multiclass logistic loss for ✓ = [✓1 · · · ✓k

    ],✓l

    2 Rd, y 2 {1, . . . , k}, kxk2 M

    `(✓;x, y) = log

    k

    X

    l=1

    exp

    xT (✓l

    � ✓y

    )

    !

    .

    Then

    Rn

    (⇥

    ) . Mp

    dkp

    n✏

    Prof. John Duchi

  • Proof of theorem

    Part 1: Consider the event bLn

    (✓) bLn

    (✓?) for some ✓ 2 ⇥✏

    ,which implies

    bLn

    (✓)� L(✓)⌘

    bLn

    (✓?)� L(✓?)⌘

    2

    ✏2

    Prof. John Duchi

  • Proof of theoremPart 2: Consider localized excess risk for ✓ 2 ⇥

    n

    X

    i=1

    [(`(✓;Xi

    )� `(✓?;Xi

    ))� (L(✓)� L(✓?))]

    and get (for all t � 0)

    P✓

    sup

    ✓2⇥✏

    bLn

    (✓)� L(✓)� (bLn

    (✓?)� L(✓?))�

    � 2Rn

    (⇥

    ) + t

    2 exp

    nt2

    2M2✏2

    Prof. John Duchi

  • Proof of theorem

    Part 3: Implications: kb✓ � ✓?k � ✏

    )

    bLn

    (✓)� bLn

    (✓?) 0 for some ✓ 2 ⇥✏

    ) sup

    ✓2⇥✏

    bLn

    (✓)� L(✓)� (bLn

    (✓?)� L(✓?))�

    �✏2

    2

    ) sup

    ✓2⇥✏

    bLn

    (✓)� L(✓)� (bLn

    (✓?)� L(✓?))�

    � 2Rn

    (⇥

    ) +

    p

    2

    Mp

    nt · ✏

    Prof. John Duchi

  • Reading and bibliography

    1. P. Bartlett, O. Bousquet, and S. Mendelson. Local rademachercomplexities.Annals of Statistics, 33(4):1497–1537, 2005

    2. A. W. van der Vaart and J. A. Wellner. Weak Convergence andEmpirical Processes: With Applications to Statistics.Springer, New York, 1996 (Ch. 3.2–3.4)

    3. S. Boucheron, O. Bousquet, and G. Lugosi. Theory ofclassification: a survey of some recent advances.ESAIM: Probability and Statistics, 9:323–375, 2005 (§5.3)

    4. A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures onStochastic Programming: Modeling and Theory.SIAM and Mathematical Programming Society, 2009 (Ch. 5.3)

    Prof. John Duchi