generalized linear models and logistic regression · the full version of the newton-raphson...

Post on 31-Jan-2020

16 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Generalized linear models and logistic regression

Comments (Oct. 15, 2019)

• Assignment 1 almost marked • Any questions about the problems on this assignment?

• Assignment 2 due soon• Any questions?

2

Summary so far• From chapters 1 and 2, obtained tools needed to talk about

uncertainty/noise underlying machine learning• capture uncertainty about data/observations using probabilities

• formalize estimation problem for distributions

• Identify variables x_1, …, x_d• e.g. observed features, observed targets

• Pick the desired distribution• e.g. p(x_1, …, x_d) or p(x_1 | x_2, …, x_d) (conditional distribution)

• e.g. p(x_i) is Poisson or p(y | x_1, …, x_d) is Gaussian

• Perform parameter estimation for chosen distribution• e.g., estimate lambda for Poisson

• e.g. estimate mu and sigma for Gaussian3

Summary so far (2)

• For prediction problems, which is much of machine learning, first discuss • the types of data we get (i.e., features and types of targets)

• goal to minimize expected cost of incorrect predictions

• Concluded optimal prediction functions use p(y | x) or E[Y | x]

• From there, our goal becomes to estimate p(y|x) or E[Y | x]

• Starting from this general problem specification, it is useful to use our parameter estimation techniques to solve this problem• e.g., specify Y = Xw + noise, estimate mu = xw

4

Summary so far (3)

• For linear regression setting, modeling p(y|x) as a Gaussian with mu = <x,w> and a constant sigma

• Performed maximum likelihood to get weights w

• Possible question: why all this machinery to get to linear regression?• one answer: makes our assumptions about uncertainty more clear

• another answer: it will make it easier to generalize p(y | x) to other distributions (which we will do with GLMs)

5

Estimation approaches for Linear regression

• Recall we estimated w for p(y | x) as a Gaussian

• We discussed the closed form solution

• and using batch or stochastic gradient descent

• Exercise: Now imagine you have 10 new data points. How do we get a new w, that incorporates these data points?

6

w = (X>X)�1X>y

wt+1 = wt � ⌘X>(Xwt � y)

wt+1 = wt � ⌘tx>t (xtwt � yt)

Exercise: MAP for Poisson

• Recall we estimated lambda for Poisson p(x)• Had a dataset of scalars {x1, …, xn}

• For MLE, found the closed form solution lambda = average of xi

• Can we use gradient descent for this optimization? And if so, should we?

7

Exercise: Predicting the number of accidents

• In Assignment 1, learned p(y) as Poisson, where Y is the number of accidents in a factory

• How would the question from assignment 1 change if we also wanted to condition on features?• For example, want to model the number of accidents in the factory,

given x1 = size of the factory and x2 = number of employees

• What is p(y | x)? What are the parameters?

8

Poisson regression

9

p(y|x) = Poisson(y|� = exp(x>w))

1. E[Y |x] = exp(w>x)

2. p(y|x) = Poisson(�)<latexit sha1_base64="/uAhu9amlIrK6dkT+dvMXWoEGMg=">AAACUHicbVFNbxMxEPWGr7J8BThysYio0ku0myLRC1IFQuIYJNIWxUvk9c6mVr22Zc+WRNv9iVx643dw4QACbxqk0jKSNU/vzWhmnnOrpMck+Rb1bty8dfvO1t343v0HDx/1Hz858KZ2AqbCKOOOcu5BSQ1TlKjgyDrgVa7gMD952+mHp+C8NPojrixkFV9oWUrBMVDz/mI7HVFG380+nS0z+poyWNohy40q/KoKqfnSfmZoLKs4Hudls2x3GIu3x12THa7oGb2kdP0IS2wmRnpvdDtkKqxS8J143h8ko2Qd9DpIN2BANjGZ989ZYURdgUahuPezNLGYNdyhFAramNUeLBcnfAGzADWvwGfN2pCWvghMQUvjwtNI1+zljoZXvrsvVHbb+6taR/5Pm9VY7mWN1LZG0OJiUFkrioZ27tJCOhCoVgFw4WTYlYpj7rjA8AedCenVk6+Dg/Eo3R2NP7wc7L/Z2LFFnpHnZEhS8orsk/dkQqZEkK/kO/lJfkXn0Y/ody+6KP2byVPyT/TiP9hHspc=</latexit>

Exponential Family Distributions

10

p(y|✓) = exp(✓y � a(✓) + b(y))

Transfer f corresponds to the derivate of the log-normalizer function a

✓ = x>w

We will always linearly predict the natural parameter

Useful property:da(✓)

d✓= E[Y ]

<latexit sha1_base64="LP8vmPiAGFSMwrT7Yi3Wzu6P7to=">AAACFHicbZDLSsNAFIYnXmu9RV26GSxCRShJFXQjFEVwWcFepAllMpm0QycXZk6EEvIQbnwVNy4UcevCnW/j9LLQ1h8GPv5zDnPO7yWCK7Csb2NhcWl5ZbWwVlzf2NzaNnd2mypOJWUNGotYtj2imOARawAHwdqJZCT0BGt5g6tRvfXApOJxdAfDhLkh6UU84JSAtrrmsRNIQjMfk7IDfQbkKM/8CeX4Ajshgb7nZdd5597tmiWrYo2F58GeQglNVe+aX44f0zRkEVBBlOrYVgJuRiRwKlhedFLFEkIHpMc6GiMSMuVm46NyfKgdHwex1C8CPHZ/T2QkVGoYerpztKSarY3M/2qdFIJzN+NRkgKL6OSjIBUYYjxKCPtcMgpiqIFQyfWumPaJTgl0jkUdgj178jw0qxX7pFK9PS3VLqdxFNA+OkBlZKMzVEM3qI4aiKJH9Ixe0ZvxZLwY78bHpHXBmM7soT8yPn8ASoOeWA==</latexit>

Examples• Gaussian distribution

• Poisson distribution

• Bernoulli distribution

11

a(✓) =1

2✓2 f(✓) = ✓

✓ = x>w

a(✓) = exp(✓) f(✓) = exp(✓)

a(✓) = ln(1 + exp(✓)) f(✓) =1

1 + exp(�✓)

p(y|✓) = exp(✓y � a(✓) + b(y))

sigmoid

Exercise: How do we extract the form for the Poisson distribution?

12

where ◊ œ R is the parameter to the distribution, a : R æ R is a log-normalizer functionand b : R æ R is a function of only x that will typically be ignored in our optimizationbecause it is not a function of ◊. Many of the often encountered (families of) distributionsare members of the exponential family; e.g. exponential, Gaussian, Gamma, Poisson, or thebinomial distributions. Therefore, it is useful to generically study the exponential family tobetter understand commonalities and di�erences between individual member functions.

Example 17: The Poisson distribution can be expressed as

p(x|⁄) = exp (x log ⁄ ≠ ⁄ ≠ log x!) ,

where ⁄ œ R+ and X = N0. Thus, ◊ = log ⁄, a(◊) = e◊, and b(x) = ≠ log x!. ⇤Now let us get some further insight into the properties of the exponential family param-

eters and why this class is convenient for estimation. The function a(◊) is typically calledthe log-partitioning function or simply a log-normalizer. It is called this because

a(◊) = logˆ

X

exp (◊x + b(x)) dx

and so plays the role of ensuring that we have a valid density:´

Xp(x)dx = 1. Importantly,

for many common GLMs, the derivative of a corresponds to the inverse of the link function.For example, for Poisson regression, the link function g(◊) = log(◊), and the derivative ofa is e◊, which is the inverse of g. Therefore, as we discuss below, the log-normalizer foran exponential family informs what link g should be used (or correspondingly the transferf = g≠1).

The properties of this log-normalizer are also key for estimation of generalized linearmodels. It can be derived that

ˆa(◊)ˆ◊

= E [X]

ˆ2a(◊)ˆ◊2 = V[X]

7.3 Formalizing generalized linear modelsWe shall now formalize the generalized linear models. The two key components of GLMscan be expressed as

1. E[y|x] = f(Ê€x) or g(E[y|x]) = Ê€x where g = f≠1

2. p(y|x) is an Exponential Family distribution

The function f is called the transfer function and g is called the link function. For Poissonregression, f is the exponential function, and as we shall see for logistic regression, f isthe sigmoid function. The transfer function adjusts the range of ʀx to the domain of Y ;because of this relationship, link functions are usually not selected independently of the dis-tribution for Y . The generalization to the exponential family from the Gaussian distributionused in ordinary least-squares regression, allows us to model a much wider range of targetfunctions. GLMs include three widely used models: linear regression, Poisson regressionand logistic regression, which we will talk about in the next chapter about classification.

80

p(y|✓) = exp(✓y � a(✓) + b(y))

• What is the transfer f?

f(✓) =da(✓)

d✓= exp(✓)

<latexit sha1_base64="mPlxiJ73hK2CB2L4cF3/NL7Os0w=">AAACH3icbZDLSsNAFIYn9V5vUZduBotQNyWpom6EohuXClaFJpTJ5KQdOrkwcyKW0Ddx46u4caGIuPNtnF4EtR4Y+Pj/czhz/iCTQqPjfFqlmdm5+YXFpfLyyuraur2xea3TXHFo8lSm6jZgGqRIoIkCJdxmClgcSLgJemdD/+YOlBZpcoX9DPyYdRIRCc7QSG37MKp62AVke/SEepFivAgp+9YGRTimwdCF++zbaNsVp+aMik6DO4EKmdRF2/7wwpTnMSTIJdO65ToZ+gVTKLiEQdnLNWSM91gHWgYTFoP2i9F9A7prlJBGqTIvQTpSf04ULNa6HwemM2bY1X+9ofif18oxOvYLkWQ5QsLHi6JcUkzpMCwaCgUcZd8A40qYv1LeZSYkNJGWTQju35On4bpec/dr9cuDSuN0Esci2SY7pEpcckQa5JxckCbh5IE8kRfyaj1az9ab9T5uLVmTmS3yq6zPLwg+ocA=</latexit>

Exercise: How do we extract the form for the exponential distribution?

• Recall exponential family distribution

• How do we write the exponential distribution this way?

• What is the transfer f?

13

� exp(��y)

p(y|✓) = exp(✓y � a(✓) + b(y))

f(✓) =d

d✓a(✓) =

�1

✓<latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit><latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit><latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit><latexit sha1_base64="KafInQrCEKwV8/yH0pAuo0L73es=">AAACJXicbZDLSsNAFIYn9VbrLerSzWAR6sKSiKALhaIblxXsBZpQJpNJO3RyYeZEKCEv48ZXcePCIoIrX8Vpm4W2Hhj4+P9zOHN+LxFcgWV9GaWV1bX1jfJmZWt7Z3fP3D9oqziVlLVoLGLZ9YhigkesBRwE6yaSkdATrOON7qZ+54lJxePoEcYJc0MyiHjAKQEt9c3roObAkAE5xTfYCSShmZ9nPp6LOSZL9pmdZ4XbN6tW3ZoVXga7gCoqqtk3J44f0zRkEVBBlOrZVgJuRiRwKlhecVLFEkJHZMB6GiMSMuVmsytzfKIVHwex1C8CPFN/T2QkVGocerozJDBUi95U/M/rpRBcuRmPkhRYROeLglRgiPE0MuxzySiIsQZCJdd/xXRIdBagg63oEOzFk5ehfV63NT9cVBu3RRxldISOUQ3Z6BI10D1qohai6Bm9onc0MV6MN+PD+Jy3loxi5hD9KeP7Bw3fpGQ=</latexit>

✓ = x>w<latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit><latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit><latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit><latexit sha1_base64="XmIcyi2DjkGQ0wE1D6kjjm/Ezcs=">AAACDHicbVDLSsNAFJ3UV62vqks3g0VwVRIRdCMU3bisYB/Q1DKZTtqhk0mYuVFLyAe48VfcuFDErR/gzr9x0gbR1gMDZ865l3vv8SLBNdj2l1VYWFxaXimultbWNza3yts7TR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVre6CLzW7dMaR7KaxhHrBuQgeQ+pwSM1CtXXBgyIPgMuwGBoecn9+mNC2H0879LTZVdtSfA88TJSQXlqPfKn24/pHHAJFBBtO44dgTdhCjgVLC05MaaRYSOyIB1DJUkYLqbTI5J8YFR+tgPlXkS8ET93ZGQQOtx4JnKbEM962Xif14nBv+0m3AZxcAknQ7yY4EhxFkyuM8VoyDGhhCquNkV0yFRhILJr2RCcGZPnifNo6pj+NVxpXaex1FEe2gfHSIHnaAaukR11EAUPaAn9IJerUfr2Xqz3qelBSvv2UV/YH18Aw25m6A=</latexit>

a(✓) = � ln(�✓)<latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit><latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit><latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit><latexit sha1_base64="X2Ru3aryqZVrVdlcVrZwCT+1nYs=">AAACBHicbZDLSgMxFIYz9VbrbdRlN8Ei1IVlRgTdCEU3LivYC3SGkkkzbWgmMyRnhDJ04cZXceNCEbc+hDvfxrSdhbYeSPj4/3NIzh8kgmtwnG+rsLK6tr5R3Cxtbe/s7tn7By0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHawehm6rcfmNI8lvcwTpgfkYHkIacEjNSzy6TqwZABOcFX+BR7QlbNPVd6dsWpObPCy+DmUEF5NXr2l9ePaRoxCVQQrbuuk4CfEQWcCjYpealmCaEjMmBdg5JETPvZbIkJPjZKH4exMkcCnqm/JzISaT2OAtMZERjqRW8q/ud1Uwgv/YzLJAUm6fyhMBUYYjxNBPe5YhTE2AChipu/YjokilAwuZVMCO7iysvQOqu5hu/OK/XrPI4iKqMjVEUuukB1dIsaqIkoekTP6BW9WU/Wi/VufcxbC1Y+c4j+lPX5A0WLle0=</latexit>

b(y) = 0<latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit><latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit><latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit><latexit sha1_base64="KwXm+UdrcxtJxVMGzZhb6OLUVG4=">AAAB73icbZBNS8NAEIYnftb6VfXoZbEI9VISEfQiFL14rGA/oA1ls520SzebuLsRSuif8OJBEa/+HW/+G7dtDtr6wsLDOzPszBskgmvjut/Oyura+sZmYau4vbO7t186OGzqOFUMGywWsWoHVKPgEhuGG4HtRCGNAoGtYHQ7rbeeUGkeywczTtCP6EDykDNqrNUOKuMzck3cXqnsVt2ZyDJ4OZQhV71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx6KkEWo/m+07IafW6ZMwVvZJQ2bu74mMRlqPo8B2RtQM9WJtav5X66QmvPIzLpPUoGTzj8JUEBOT6fGkzxUyI8YWKFPc7krYkCrKjI2oaEPwFk9ehuZ51bN8f1Gu3eRxFOAYTqACHlxCDe6gDg1gIOAZXuHNeXRenHfnY9664uQzR/BHzucPB3qOow==</latexit>

p(y|✓) = exp(✓y) exp(�a(✓)) exp(b(y))<latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit><latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit><latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit><latexit sha1_base64="t9R/g0rc179gUcoKP7Lb7l0WLoA=">AAACIXicbVDLSgNBEJz1GeMr6tHLYBCyB8OuCOYiBL14jGAekIQwO+kkQ2YfzPSKS8yvePFXvHhQJDfxZ5wke9DEgoHqqm56urxICo2O82WtrK6tb2xmtrLbO7t7+7mDw5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7w5upX38ApUUY3GMSQdtn/UD0BGdopE6uFBWSpxYOAJlNr2gLHqPCvKSJPS/PWKrYqeAVEtvu5PJO0ZmBLhM3JXmSotLJTVrdkMc+BMgl07rpOhG2R0yh4BLG2VasIWJ8yPrQNDRgPuj2aHbhmJ4apUt7oTIvQDpTf0+MmK914num02c40IveVPzPa8bYK7VHIohihIDPF/ViSTGk07hoVyjgKBNDGFfC/JXyAVOMowk1a0JwF09eJrXzomv43UW+fJ3GkSHH5IQUiEsuSZnckgqpEk6eySt5Jx/Wi/VmfVqTeeuKlc4ckT+wvn8AhtOhNg==</latexit>

i.e.,

� > 0<latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit><latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit><latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit><latexit sha1_base64="gL/ES4sLDHFKGqtzp8ncprsQ++c=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6EqKblxWsA9oQ5lMJu3QySTM3Agl9DPcuFDErV/jzr9x2mahrQcGDuecy9x7glQKg6777ZTW1jc2t8rblZ3dvf2D6uFR2ySZZrzFEpnobkANl0LxFgqUvJtqTuNA8k4wvpv5nSeujUjUI05S7sd0qEQkGEUr9frSRkNKbog7qNbcujsHWSVeQWpQoDmofvXDhGUxV8gkNabnuSn6OdUomOTTSj8zPKVsTIe8Z6miMTd+Pl95Ss6sEpIo0fYpJHP190ROY2MmcWCTMcWRWfZm4n9eL8Po2s+FSjPkii0+ijJJMCGz+0koNGcoJ5ZQpoXdlbAR1ZShbaliS/CWT14l7Yu6Z/nDZa1xW9RRhhM4hXPw4AoacA9NaAGDBJ7hFd4cdF6cd+djES05xcwx/IHz+QPkSpBT</latexit>

Logistic regression

14

the full version of the Newton-Raphson algorithm with the Hessian matrix.Instead, Gauss-Newton and other types of solutions are considered and aregenerally called iteratively reweighted least-squares (IRLS) algorithms in thestatistical literature.

5.4 Logistic regression

At the end, we mention that GLMs extend to classification. One of the mostpopular uses of GLMs is a combination of a Bernoulli distribution with alogit link function. This framework is frequently encountered and is calledlogistic regression. We summarize the logistic regression model as follows

1. logit(E[y|x]) = !Tx

2. p(y|x) = Bernoulli(↵)

where logit(x) = ln x

1�x, y 2 {0, 1}, and ↵ 2 (0, 1) is the parameter (mean)

of the Bernoulli distribution. It follows that

E[y|x] = 1

1 + e�!Tx

and

p(y|x) =✓

1

1 + e�!Tx

◆y ✓1� 1

1 + e�!Tx

◆1�y

.

Given a data set D = {(xi, yi)}ni=1, where xi 2 Rk and yi 2 {0, 1}, the pa-rameters of the model w can be found by maximizing the likelihood function.

86

−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f(t) =1

1 + e�t

t

Figure 6.2: Sigmoid function in [�5, 5] interval.

as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.

6.1.2 Maximum conditional likelihood estimation

To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution

p(y|x) =

8<

:

⇣1

1+e�!>x

⌘y

⇣1� 1

1+e�!>x

⌘1�y

for y = 1

for y = 0(6.2)

= �(x>w)y(1� �(x>w))1�y

where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional

89

↵ = p(y = 1|x)g(x>w) = logit(x>w)

f(x>w) = g�1(x>w)

= sigmoid(x>w)

= E[y|x]

linear classifiers, our flexibility is restricted because our method must learn the posteriorprobabilities p(y|x) and at the same time have a linear decision surface in Rd. This, however,can be achieved if p(y|x) is modeled as a monotonic function of w€x; e.g., tanh(w€x) or(1 + e≠w

€x)≠1. Of course, a model trained to learn posterior probabilities p(y|x) can be

seen as a “soft” predictor or a scoring function s : X æ [0, 1]. Then, the conversion from sto f is a straightforward application of the maximum a posteriori principle: the predictedoutput is positive if s(x) Ø 0.5 and negative if s(x) < 0.5. More generally, the scoringfunction can be any mapping s : X æ R, with thresholding applied based on any particularvalue · .

8.1 Logistic regression

Let us consider binary classification in Rd, where X = Rd+1 and Y = {0, 1}. Logistic re-gression is a Generalized Linear Model, where the distribution over Y given x is a Bernoullidistribution, and the transfer function is the sigmoid function, also called the logistic func-tion,

‡(t) =11 + e≠t

2≠1

plotted in Figure 8.2. In the same terminology as for GLMs, the transfer function is thesigmoid and the link function—the inverse of the transfer function— is the logit functionlogit(x) = ln x

1≠x, with

1. E[y|x] = ‡(Ê€x)

2. p(y|x) = Bernoulli(–) with – = E[y|x].

The Bernoulli distribution, with – a function of x, is

p(y|x) =

Y_]

_[

11

1+e≠Ê€x

2y

11 ≠

11+e≠Ê€x

21≠y

for y = 1

for y = 0(8.1)

= ‡(x€w)y(1 ≠ ‡(x€w))1≠y

where Ê = (Ê0, Ê1, . . . , Êd) is a set of unknown coe�cients we want to recover (or learn).Notice that our prediction is ‡(Ê€x), and that it satisfies

p(y = 1|x) = ‡(Ê€x).

Therefore, as with many binary classification approaches, our goal is to predict the proba-bility that the class is 1; given this probability, we can infer p(y = 0|x) = 1 ≠ p(y = 1|x).

8.1.1 Predicting class labelsThe function learned by logistic regression returns a probability, rather than an explicitprediction of 0 or 1. Therefore, we have to take this probability estimate and convert it toa suitable prediction of the class. For a previously unseen data point x and a set of learnedcoe�cients w, we simply calculate the posterior probability as

P (Y = 1|x, w) = 11 + e≠w€x

.

84

What is c(w) for GLMs?

• Still formulating an optimization problem to predict targets y given features x

• The variables we learn is the weight vector w

• What is c(w)?

15

MLE : c(w) / � ln p(D|w)

/ �nX

i=1

ln p(yi|xiw)

argminw

c(w) = argmaxw

p(D|w)

Cross-entropy loss for Logistic Regression

16

ci(w) = yi ln�(w>xi) + (1� yi) ln(1� �(w>xi))

<latexit sha1_base64="H6v0zpMCsFrmirWF3kTxlYabSW8=">AAACX3icjVHPS8MwGE3rrznnrHoSL8EhbIijVUEvwtCLRwXnBmstaZZuwTQtSaqW0n/Sm+DF/8R0TpibBz8IvLz3Pb4vL0HCqFS2/W6YS8srq2uV9epGbbO+ZW3vPMg4FZh0ccxi0Q+QJIxy0lVUMdJPBEFRwEgveLou9d4zEZLG/F5lCfEiNOI0pBgpTfnWM/Zp042QGgdh/lK04CXMfApdxqEr6ShCM+Kjq+IE/txfC5+24BFsOsdZibSl6cDjf9lavtWw2/ak4CJwpqABpnXrW2/uMMZpRLjCDEk5cOxEeTkSimJGiqqbSpIg/IRGZKAhRxGRXj7Jp4CHmhnCMBb6cAUn7KwjR5GUWRToznJLOa+V5F/aIFXhhZdTnqSKcPw9KEwZVDEsw4ZDKghWLNMAYUH1rhCPkUBY6S+p6hCc+ScvgoeTtnPaPrk7a3SupnFUwD44AE3ggHPQATfgFnQBBh+GaWwYNePTXDPrpvXdahpTzy74VebeF/dbtDs=</latexit>

Extra exercises

• Go through the derivation of c(w) for logistic regression

• Derive Maximum Likelihood objective in Section 8.1.2

17

Benefits of GLMs

• Gave a generic update rule, where you only needed to know the transfer for your chosen distribution• e.g., linear regression with transfer f = identity

• e.g., Poisson regression with transfer f = exp

• e.g., logistic regression with transfer f = sigmoid

• We know the objective is convex in w!

18

Convexity

• Convexity of negative log likelihood of (many) exponential families• The negative log likelihood of many exponential families is convex, which

is an important advantage of the maximum likelihood approach

• Why is convexity important?• e.g.,(sigmoid(xw) - y)^2 is nonconvex, but who cares?

19

Cross-entropy loss versus Euclidean loss for classification

• Why not just use

• The notes explain that this is a non-convex objective• from personal experience, it seems to do more poorly in practice

• If no obvious reason to prefer one or the other, we may as well pick the objective that is convex (no local minima)

20

minw2Rd

nX

i=1

(�(x>i w)� yi)

2

<latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit><latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit><latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit><latexit sha1_base64="Ut4UtA6c8YroSPY9QQmuscEMLOY=">AAACR3icbVDLSsQwFE3H1zi+qi7dBAdhZqG0IuhGEN24HMVRYfogTdMxmKQlSdWh9O/cuHXnL7hxoYhLMw/xMV4InJxzz83NiTJGlXacJ6syMTk1PVOdrc3NLywu2csr5yrNJSZtnLJUXkZIEUYFaWuqGbnMJEE8YuQiuj7q6xc3RCqaijPdy4jPUVfQhGKkDRXagcepCAuPI30VJcVtCT0q4PAaFadlEBtG5Tws6L5bBgI2PEW7HDW+HHdlSANPpxn8ntGEm7AX0mawHdp1Z8sZFBwH7gjUwahaof3oxSnOOREaM6RUx3Uy7RdIaooZKWterkiG8DXqko6BAnGi/GKQQwk3DBPDJJXmCA0H7E9HgbhSPR6Zzv6u6q/WJ//TOrlO9vyCiizXRODhQ0nOoE5hP1QYU0mwZj0DEJbU7ArxFZIIaxN9zYTg/v3yODjf3nINPtmpHxyO4qiCNbAOGsAFu+AAHIMWaAMM7sEzeAVv1oP1Yr1bH8PWijXyrIJfVbE+ARPRsn0=</latexit>

ci(w) = yi ln�(w>xi) + (1� yi) ln(1� �(w>xi))

<latexit sha1_base64="H6v0zpMCsFrmirWF3kTxlYabSW8=">AAACX3icjVHPS8MwGE3rrznnrHoSL8EhbIijVUEvwtCLRwXnBmstaZZuwTQtSaqW0n/Sm+DF/8R0TpibBz8IvLz3Pb4vL0HCqFS2/W6YS8srq2uV9epGbbO+ZW3vPMg4FZh0ccxi0Q+QJIxy0lVUMdJPBEFRwEgveLou9d4zEZLG/F5lCfEiNOI0pBgpTfnWM/Zp042QGgdh/lK04CXMfApdxqEr6ShCM+Kjq+IE/txfC5+24BFsOsdZibSl6cDjf9lavtWw2/ak4CJwpqABpnXrW2/uMMZpRLjCDEk5cOxEeTkSimJGiqqbSpIg/IRGZKAhRxGRXj7Jp4CHmhnCMBb6cAUn7KwjR5GUWRToznJLOa+V5F/aIFXhhZdTnqSKcPw9KEwZVDEsw4ZDKghWLNMAYUH1rhCPkUBY6S+p6hCc+ScvgoeTtnPaPrk7a3SupnFUwD44AE3ggHPQATfgFnQBBh+GaWwYNePTXDPrpvXdahpTzy74VebeF/dbtDs=</latexit>

How can we check convexity?

• Can check the definition of convexity

• Can check second derivative for scalar parameters (e.g. ) and Hessian for multidimensional parameters (e.g., )• e.g., for linear regression (least-squares), the Hessian is

and so positive semi-definite

• e.g., for Poisson regression, the Hessian of the negative log-likelihood is and so positive semi-definite

21

�w

Convex function on an interval.

A function (in black) is convex if and only if theregion above its graph (in green) is a convex set.

Convex functionFrom Wikipedia, the free encyclopedia

In mathematics, a real-valued function f(x) defined on an interval is

called convex (or convex downward or concave upward) if the line

segment between any two points on the graph of the function lies above

or on the graph, in a Euclidean space (or more generally a vector space)

of at least two dimensions. Equivalently, a function is convex if its

epigraph (the set of points on or above the graph of the function) is a

convex set. Well-known examples of convex functions are the quadratic

function and the exponential function for any

real number x.

Convex functions play an important role in many areas of mathematics.

They are especially important in the study of optimization problems

where they are distinguished by a number of convenient properties. For

instance, a (strictly) convex function on an open set has no more than

one minimum. Even in infinite-dimensional spaces, under suitable

additional hypotheses, convex functions continue to satisfy such

properties and, as a result, they are the most well-understood functionals

in the calculus of variations. In probability theory, a convex function

applied to the expected value of a random variable is always less than or

equal to the expected value of the convex function of the random

variable. This result, known as Jensen's inequality, underlies many

important inequalities (including, for instance, the arithmetic–geometric

mean inequality and Hölder's inequality).

Exponential growth is a special case of convexity. Exponential growth

narrowly means "increasing at a rate proportional to the current value",

while convex growth generally means "increasing at an increasing rate (but not necessarily proportionally to current value)".

Contents1 Definition

2 Properties

3 Convex function calculus

4 Strongly convex functions

4.1 Uniformly convex functions

5 Examples

6 See also

7 Notes

8 References

9 External links

Definition

Let X be a convex set in a real vector space and let f : X → R be a function.

f is called convex if:

f is called strictly convex if:

H = X>CX

H = X>X

Prediction with logistic regression

• So far, we have used the prediction f(xw)• eg., xw for linear regression, exp(xw) for Poisson regression

• For binary classification, want to output 0 or 1, rather than the probability value p(y = 1 | x) = sigmoid(xw)

• Sigmoid has few values xw mapped close to 0.5; most values somewhat larger than 0 are mapped close to 0 (and vice versa for 1)

• Decision threshold: • sigmoid(xw) < 0.5 is class 0

• sigmoid(xw) > 0.5 is class 1

22−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f(t) =1

1 + e�t

t

Figure 6.2: Sigmoid function in [�5, 5] interval.

as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.

6.1.2 Maximum conditional likelihood estimation

To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution

p(y|x) =

8<

:

⇣1

1+e�!>x

⌘y

⇣1� 1

1+e�!>x

⌘1�y

for y = 1

for y = 0(6.2)

= �(x>w)y(1� �(x>w))1�y

where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional

89

Logistic regression is a linear classifier

• Hyperplane separates the two classes• P(y=1 | x, w) > 0.5 only when

• P(y=0 | x, w) > 0.5 only when P(y=1 | x, w) < 0.5, which happens when

23

( , )x1 1y

( , )xi yi( , )x2 2y

w w w x0 1 2 2+ +x1 0=

x1

x2

++

+

+

+

+

+

+ +

Figure 6.1: A data set in R2 consisting of nine positive and nine negativeexamples. The gray line represents a linear decision surface in R2. Thedecision surface does not perfectly separate positives from negatives.

to f is a straightforward application of the maximum a posteriori principle:the predicted output is positive if g(x) � 0.5 and negative if g(x) < 0.5.

6.1 Logistic regression

Let us consider binary classification in Rd, where X = Rd+1 and Y = {0, 1}.The basic idea for many classification approaches is to hypothesize a closed-form representation for the posterior probability that the class label is posi-tive and learn parameters w from data. In logistic regression, this relation-ship can be expressed as

P (Y = 1|x,w) =1

1 + e�w>x, (6.1)

which is a monotonic function of w>x. Function �(t) =�1 + e�t

��1 is calledthe sigmoid function or the logistic function and is plotted in Figure 6.2.

6.1.1 Predicting class labels

For a previously unseen data point x and a set of coefficients w⇤ found fromEq. (6.4) or Eq. (6.11), we simply calculate the posterior probability as

P (Y = 1|x,w⇤) =1

1 + e�w⇤T ·x .

If P (Y = 1|x,w⇤) � 0.5 we conclude that data point x should be labeled aspositive (y = 1). Otherwise, if P (Y = 1|x,w⇤) < 0.5, we label the data point

88

−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f(t) =1

1 + e�t

t

Figure 6.2: Sigmoid function in [�5, 5] interval.

as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.

6.1.2 Maximum conditional likelihood estimation

To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution

p(y|x) =

8<

:

⇣1

1+e�!>x

⌘y

⇣1� 1

1+e�!>x

⌘1�y

for y = 1

for y = 0(6.2)

= �(x>w)y(1� �(x>w))1�y

where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional

89

−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f(t) =1

1 + e�t

t

Figure 6.2: Sigmoid function in [�5, 5] interval.

as negative (y = 0). Thus, the predictor maps a (d+ 1)-dimensional vectorx = (x0 = 1, x1, . . . , xd) into a zero or one. Note that P (Y = 1|x,w⇤) � 0.5only when w>x � 0. The expression w>x = 0 represents equation of ahyperplane that separates positive and negative examples. Thus, the logisticregression model is a linear classifier.

6.1.2 Maximum conditional likelihood estimation

To frame the learning problem as parameter estimation, we will assume thatthe data set D = {(xi, yi)}ni=1 is an i.i.d. sample from a fixed but unknownprobability distribution p(x, y). Even more specifically, we will assume thatthe data generating process randomly draws a data point x, a realization ofthe random vector (X0 = 1, X1, . . . , Xd), according to p(x) and then sets itsclass label Y according to the Bernoulli distribution

p(y|x) =

8<

:

⇣1

1+e�!>x

⌘y

⇣1� 1

1+e�!>x

⌘1�y

for y = 1

for y = 0(6.2)

= �(x>w)y(1� �(x>w))1�y

where ! = (!0,!1, . . . ,!d) is a set of unknown coefficients we want torecover (or learn) from the observed data D. Based on the principles ofparameter estimation, we can estimate ! by maximizing the conditional

89

w>x < 0e.g., w = [2.75 -1/3 -1]

x = [1 4 3]

x>w = �1.8

x = [1 2 1]

x>w = 0.27

Logistic regression versus Linear regression

• Why might one be better than the other? They both use a linear approach

• Linear regression could still learn <x, w> to predict E[Y | x]

• Demo: logistic regression performs better under outliers, when the outlier is still on the correct side of the line

• Conclusion: • logistic regression better reflects the goals of predicting p(y=1 | x), to

finding separating hyperplane

• Linear regression assumes E[Y | x] a linear function of x!

24

Adding regularizers to GLMs• How do we add regularization to logistic regression?

• We had an optimization for logistic regression to get w: minimize negative log-likelihood, i.e. minimize cross-entropy

• Now want to balance negative log-likelihood and regularizer (i.e., the prior for MAP)

• Simply add regularizer to the objective function

25

Adding a regularizer to logistic regression

• Original objective function for logistic regression

26

everything more suitable for further steps, we will slightly rearrange Eq. (6.6). Notice firstthat

✓1� 1

1 + e�w>xi

◆=

e�w>xi

1 + e�w>xi

giving

ll(w) =nX

i=1

⇣� yi · log

⇣1 + e�w

>xi

⌘+ (1� yi) · log

⇣e�w

>xi

� (1� yi) · log⇣1 + e�w

>xi

⌘⌘

=nX

i=1

✓(yi � 1)w>

xi + log

✓1

1 + e�w>xi

◆◆. (6.7)

It does not take much effort to realize that there is no closed-form solution to rll(w) = 0

(we did have this luxury in linear regression, but not here). Thus, we have to proceedwith iterative optimization methods. We will calculate rll(w) and Hll(w) to enable eithermethod (first-order with only the gradient or second-order with the gradient and Hessian),both written as a function of inputs X, class labels y, and the current parameter vector.We can calculate the first and second partial derivatives of ll(w) as follows

@ll(w)

@wj

=nX

i=1

✓(yi � 1) · xij �

1

1 + e�w>xi

· e�w>xi · (�xij)

=nX

i=1

xij · yi � 1 +

e�w>xi

1 + ew>xi

!

=nX

i=1

xij ·✓yi �

1

1 + e�w>xi

= f>j (y � p) ,

where fj is the j-th column (feature) of data matrix X, y is an n-dimensional column vectorof class labels and p is an n-dimensional column vector of (estimated) posterior probabilitiespi = P (Yi = 1|xi,w), for i = 1, ..., n. Considering partial derivatives for every componentof w, we have

rll(w) = X> (y � p) . (6.8)

The second partial derivative of the log-likelihood function can be found as

@2ll(w)

@wj@wd

=nX

i=1

xij ·e�w

>xi

�1 + e�w>xi

�2 · (�xik)

= �nX

i=1

xij ·1

1 + e�w>xi

·✓1� 1

1 + e�w>xi

◆· xik

= �f>j P (I�P) fd,

70

argmaxw

argminw

everything more suitable for further steps, we will slightly rearrange Eq. (6.6). Notice firstthat

✓1� 1

1 + e�w>xi

◆=

e�w>xi

1 + e�w>xi

giving

ll(w) =nX

i=1

⇣� yi · log

⇣1 + e�w

>xi

⌘+ (1� yi) · log

⇣e�w

>xi

� (1� yi) · log⇣1 + e�w

>xi

⌘⌘

=nX

i=1

✓(yi � 1)w>

xi + log

✓1

1 + e�w>xi

◆◆. (6.7)

It does not take much effort to realize that there is no closed-form solution to rll(w) = 0

(we did have this luxury in linear regression, but not here). Thus, we have to proceedwith iterative optimization methods. We will calculate rll(w) and Hll(w) to enable eithermethod (first-order with only the gradient or second-order with the gradient and Hessian),both written as a function of inputs X, class labels y, and the current parameter vector.We can calculate the first and second partial derivatives of ll(w) as follows

@ll(w)

@wj

=nX

i=1

✓(yi � 1) · xij �

1

1 + e�w>xi

· e�w>xi · (�xij)

=nX

i=1

xij · yi � 1 +

e�w>xi

1 + ew>xi

!

=nX

i=1

xij ·✓yi �

1

1 + e�w>xi

= f>j (y � p) ,

where fj is the j-th column (feature) of data matrix X, y is an n-dimensional column vectorof class labels and p is an n-dimensional column vector of (estimated) posterior probabilitiespi = P (Yi = 1|xi,w), for i = 1, ..., n. Considering partial derivatives for every componentof w, we have

rll(w) = X> (y � p) . (6.8)

The second partial derivative of the log-likelihood function can be found as

@2ll(w)

@wj@wd

=nX

i=1

xij ·e�w

>xi

�1 + e�w>xi

�2 · (�xik)

= �nX

i=1

xij ·1

1 + e�w>xi

·✓1� 1

1 + e�w>xi

◆· xik

= �f>j P (I�P) fd,

70

• Adding regularizer

argminw

everything more suitable for further steps, we will slightly rearrange Eq. (6.6). Notice firstthat

✓1� 1

1 + e�w>xi

◆=

e�w>xi

1 + e�w>xi

giving

ll(w) =nX

i=1

⇣� yi · log

⇣1 + e�w

>xi

⌘+ (1� yi) · log

⇣e�w

>xi

� (1� yi) · log⇣1 + e�w

>xi

⌘⌘

=nX

i=1

✓(yi � 1)w>

xi + log

✓1

1 + e�w>xi

◆◆. (6.7)

It does not take much effort to realize that there is no closed-form solution to rll(w) = 0

(we did have this luxury in linear regression, but not here). Thus, we have to proceedwith iterative optimization methods. We will calculate rll(w) and Hll(w) to enable eithermethod (first-order with only the gradient or second-order with the gradient and Hessian),both written as a function of inputs X, class labels y, and the current parameter vector.We can calculate the first and second partial derivatives of ll(w) as follows

@ll(w)

@wj

=nX

i=1

✓(yi � 1) · xij �

1

1 + e�w>xi

· e�w>xi · (�xij)

=nX

i=1

xij · yi � 1 +

e�w>xi

1 + ew>xi

!

=nX

i=1

xij ·✓yi �

1

1 + e�w>xi

= f>j (y � p) ,

where fj is the j-th column (feature) of data matrix X, y is an n-dimensional column vectorof class labels and p is an n-dimensional column vector of (estimated) posterior probabilitiespi = P (Yi = 1|xi,w), for i = 1, ..., n. Considering partial derivatives for every componentof w, we have

rll(w) = X> (y � p) . (6.8)

The second partial derivative of the log-likelihood function can be found as

@2ll(w)

@wj@wd

=nX

i=1

xij ·e�w

>xi

�1 + e�w>xi

�2 · (�xik)

= �nX

i=1

xij ·1

1 + e�w>xi

·✓1� 1

1 + e�w>xi

◆· xik

= �f>j P (I�P) fd,

70

+�kwk22

Other regularizers• Have discussed l2 and l1 regularizers

• Other examples: • elastic net regularization is a combination of l1 and l2 (i.e., l1 + l2):

ensures a unique solution

• capped regularizers: do not prevent large weights

27* Figure from “Robust Dictionary Learning with Capped l1-Norm”, Jiang et al., IJCAI 2015

Does this regularizer still protect against

overfitting?

Practical considerations: outliers• What happens if one sample is bad?

• Regularization helps a little bit

• Can also change losses

• Robust losses• use l1 instead of l2

• even better: use capped l1

• What are the disadvantages to these losses?

28

Exercise: intercept unit

• In linear regression, we added an intercept unit (bias unit) to the features• i.e., added a feature that is always 1 to the feature vector

• Does it make sense to do this for GLMs?• e.g., sigmoid(<x,w> + w_0)

29

Adding a column of ones to GLMs

• This provides the same outcome as for linear regression

• g(E[y | x]) = x w —> bias unit in x with coefficient w0 shifts the function left or right

30*Figure from http://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks

top related