linear regression & gradient descentbboots3/cs4641-fall2018/...linear regression & gradient...

LinearRegression&GradientDescent

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyByronBoots,withgratefulacknowledgementtoEricEatonandthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.

RegressionGiven:– Datawhere

– Correspondinglabelswhere

2

0

1

2

3

4

5

6

7

8

9

1970 1980 1990 2000 2010 2020

Septem

berA

rcticSeaIceExtent

(1,000,000sq

km)

Year

DatafromG.Witt.JournalofStatisticsEducation,Volume21,Number1(2013)

LinearRegressionQuadraticRegression

X =n

x

(1), . . . ,x(n)o

x

(i) 2 Rd

y =n

y(1), . . . , y(n)o

y(i) 2 R

LinearRegression• Hypothesis:

• Fitmodelbyminimizingsumofsquarederrors

3

x

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Assumex0 =1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Figures are courtesy ofGregShakhnarovich

LeastSquaresLinearRegression

4

• CostFunction

• Fitbysolving

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

min✓

J(✓)

IntuitionBehindCostFunction

5SlidebyAndrewNg


6

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg


7


SlidebyAndrewNg


8


SlidebyAndrewNg


9


SlidebyAndrewNg

BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce

10

✓

✓ J(✓)

q1q0

J(q0,q1)

FigurebyAndrewNg


11

✓

✓

J(✓)

q1q0

J(q0,q1)

✓

FigurebyAndrewNg


12

✓

✓

J(✓)

q1q0

J(q0,q1)

✓

FigurebyAndrewNg

Sincetheleastsquaresobjectivefunctionisconvex(concave),wedon’tneedtoworryaboutlocalminimainlinearregression

GradientDescent• Initialize• Repeatuntilconvergence

13

✓

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

learningrate(small)e.g.,α=0.05

J(✓)

✓

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

↵

GradientDescent• Initialize• Repeatuntilconvergence

14

✓

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

GradientDescentforLinearRegression

• Initialize• Repeatuntilconvergence

15

✓

simultaneousupdateforj =0...d

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

• Toachievesimultaneousupdate• AtthestartofeachGDiteration,compute• Usethisstoredvalueintheupdatesteploop

h✓

⇣x

(i)⌘

kvk2 =

sX

i

v2i =q

v21 + v22 + . . .+ v2|v|L2 norm:

k✓new

� ✓old

k2 < ✏• Assumeconvergencewhen

GradientDescent

16


h(x)=-900– 0.1x

SlidebyAndrewNg

GradientDescent

17


SlidebyAndrewNg

GradientDescent

18


SlidebyAndrewNg

GradientDescent

19


SlidebyAndrewNg

GradientDescent

20


SlidebyAndrewNg

GradientDescent

21


SlidebyAndrewNg

GradientDescent

22


SlidebyAndrewNg

GradientDescent

23


SlidebyAndrewNg

GradientDescent

24


SlidebyAndrewNg

Choosingα

25

αtoosmall

slowconvergence

αtoolarge

Increasingvaluefor J(✓)

• Mayovershoottheminimum• Mayfailtoconverge• Mayevendiverge

Toseeifgradientdescentisworking,printouteachiteration• Thevalueshoulddecreaseateachiteration• Ifitdoesn’t,adjustα

J(✓)

ExtendingLinearRegressiontoMoreComplexModels

• TheinputsX forlinearregressioncanbe:– Originalquantitativeinputs– Transformationofquantitativeinputs

• e.g.log,exp,squareroot,square,etc.

– Polynomialtransformation• example:y =b0 +b1×x +b2×x2 +b3×x3

– Basisexpansions– Dummycodingofcategoricalinputs– Interactionsbetweenvariables

• example:x3 =x1 × x2

Thisallowsuseoflinear regressiontechniquestofitnon-linear datasets.

LinearBasisFunctionModels

• Generally,

• Typically,sothatactsasabias• Inthesimplestcase,weuselinearbasisfunctions:

h✓(x) =dX

j=0

✓j�j(x)

�0(x) = 1 ✓0

�j(x) = xj

basisfunction

BasedonslidebyChristopherBishop(PRML)

LinearBasisFunctionModels

– Theseareglobal;asmallchangeinx affectsallbasisfunctions

• Polynomialbasisfunctions:

• Gaussianbasisfunctions:

– Thesearelocal;asmallchangeinx onlyaffectnearbybasisfunctions.μj ands controllocationandscale(width).


LinearBasisFunctionModels• Sigmoidal basisfunctions:

where

– Thesearealsolocal;asmallchangeinx onlyaffectsnearbybasisfunctions.μjands controllocationandscale(slope).


ExampleofFittingaPolynomialCurvewithaLinearModel

y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px

p =pX

j=0

✓jxj

QualityofFit

Overfitting:• Thelearnedhypothesismayfitthetrainingsetverywell( )

• ...butfailstogeneralizetonewexamples

31

Price

Size

Price

Size

Price

Size

Underfitting(highbias)

Overfitting(highvariance)

Correctfit

J(✓) ⇡ 0

BasedonexamplebyAndrewNg

Regularization• Amethodforautomaticallycontrollingthecomplexityofthelearnedhypothesis

• Idea:penalizeforlargevaluesof– Canincorporateintothecostfunction– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredictingthelabel

• Canalsoaddressoverfitting byeliminatingfeatures(eithermanuallyorviamodelselection)

32

✓j

Regularization• Linearregressionobjectivefunction

– istheregularizationparameter()– Noregularizationon!

33

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+ �

dX

j=1

✓2j

modelfittodata regularization

✓0

� � � 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

UnderstandingRegularization

• Notethat

– Thisisthemagnitudeofthefeaturecoefficientvector!

• Wecanalsothinkofthisas:

• L2 regularizationpullscoefficientstoward0

34

dX

j=1

✓2j = k✓1:dk22

dX

j=1

(✓j � 0)2 = k✓1:d � ~0k22

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

UnderstandingRegularization

• Whathappensifwesettobehuge(e.g.,1010)?

35

�Price

Size0 0 0 0

BasedonexamplebyAndrewNg

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

RegularizedLinearRegression

36

• CostFunction

• Fitbysolving

• Gradientupdate:

min✓

J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘

regularization

@

@✓jJ(✓)

@

@✓0J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � �✓j

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

RegularizedLinearRegression

37

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘

• Wecanrewritethegradientstepas:

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � �✓j

✓j ✓j (1� ↵�)� ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

linear regression & gradient descentbboots3/cs4641-fall2018/...linear regression & gradient...

Documents