bayesian linear regression -...

43
Bayesian Linear Regression Pattern Recognition 2016 Sandro Schönborn University of Basel

Upload: hadan

Post on 31-Mar-2018

226 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

BayesianLinearRegressionPatternRecognition2016

SandroSchönborn

UniversityofBasel

Page 2: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Outline

• RegressionProblem• Continuouslabel:noclasses• AccessibleBayesianexample

• LeastSquaresRegression

• BayesianRegression:weightedaverageofallmodels

• Uncertainty:Bayesianinferenceandsubjectiveprobability

• Outlook:KernelridgeregressionandGaussianProcesses

2

Page 3: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Motivation:Regression

Notalldatainferenceproblemsareaboutclassification.Sometimes,weneedtopredictacontinuousvalue(e.g.Theprice ofafishinsteadofitsclass)

• Machinelearningproblemnowwithcontinuouslabels:Regression

Wedidwellwithprobabilisticmethods.Theydelivergoodandvaluableresults.Thediscriminativeapproachissimpler.

• Regressionasadiscriminative,probabilisticmethod

Morethanonesolutionisgood.Wewanttoaverageoverallpossibleresultsandnotselectonlythesinglebestone.

• RegressionisatractableexampleofaBayesianmethod

3

Page 4: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Regression

4

0 10 20 30

05

1015

Regression

x1

x 2

0 10 20 30

05

1015

Classification

x1

x 2

Page 5: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Regression:FormalSetup

• Data:𝒙 ∈ ℝ$,fornow:standardvectorspacedata,featurevector

• Labels:𝑦 ∈ ℝ,labelsarecontinuous

• Trainingdata:𝐷 = 𝒙(, 𝑦( (*+, knownlabelsforourtrainingdata

• Goal:Regressionontestdata• Predictagoodlabelforagivendatum𝒙

𝑦- = 𝑓(𝒙)• Machinelearningproblem:findfunction𝑓 topredictthelabel

• Learning/estimationon(limited)training data• Predictionqualitywithrespectto(unknown)test data

5

Page 6: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

LinearRegression

• Standard method:linearleastsquaresfittodata• Knownin1dfromschool:“Ausgleichsgerade”• Knowninnd frombasiclectures

• Linearmodelforthelabelvariable𝑦:

𝑦 = 𝒘2𝒙

• Training/Learningwithadataset𝐷 = 𝑥(, 𝑦( (*+,

Howtofind𝒘,𝑤5?Howtomeasurelabel/predictionerror?

6

Weuseanoldtricktokeepitsimple:

𝒘 ≔ 𝑤5𝒘 , 𝒙 ≔ 1

𝒙

Page 7: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

LeastSquaresSolution

Thelinearmodelshouldfitthetrainingdataoptimally.Theeasiestlossfunctiontominimizeisthesquarederror:

Training:Find𝒘,𝑤5 suchthatthesumofthesquaredreconstructionerrorsofthetrainingsetisminimal:

𝒘,𝑤5 = argmin𝒘,>?

@ 𝑦( − 𝒘2𝒙( B�

(

Well-knownsolution:𝒘 = 𝑿𝑿2 E+𝑿𝒚

7

𝐿 𝑦, 𝒙, 𝑓 = 𝑦 − 𝑓 𝒙 B

Page 8: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

ProbabilisticSetup

Inourprobabilisticsetup,wehaveadistributionofpredictions,givenadatapoint:

• SimilartoposteriorclassprobabilitywithBayes:Butthelabelisnowcontinuous– therearemorethantwovalues!

• ThebestsinglepredictiontomakedependsonourriskfunctionVeryoftenthisistheexpectedvalue(e.g.squaredlossrisk)

• Directposteriormodel– discriminativemethod

8

𝑃 𝑦 𝒙

𝑦- = 𝐸[𝑦|𝒙]

Page 9: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

ProbabilisticSetup

Weuseasimpleposteriormodelforthelabelgivendata:𝑃 𝑦 𝒙

• Eachobservationisaffectedbyanoisevalueε ∼ 𝑁 𝜀 0, 𝜎B

• Thesingle bestpredictionof𝑦 isstandardlinearregression

9

𝑃 𝑦 𝒙;𝒘 = 𝒩 𝑦 𝒘2𝒙, 𝜎B

𝑦 = 𝒘2𝒙 + ε

𝑦- = 𝐸 𝑦 = 𝒘2𝒙

Page 10: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

MaximumLikelihood:Regression

• Thediscriminativeprobabilisticmodelcanbetrainedbymaximum-likelihood estimation

• TheresultisidenticaltotheknownleastsquaressolutionLeastsquaresusuallycorrespondstoGaussiannoiseassumptions

• Again:maximizeposterior ofdata(discriminativelikelihood)

𝒘,𝑤5 = argmax𝒘,>?

𝑃 𝑌 𝑿,𝒘

𝑃 𝑌 𝑿 =YYZ[𝑃 𝑦( 𝒙(

(

=[𝒩 𝑦( 𝒘2𝒙(, 𝜎B�

(

10

Page 11: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

MaximumLikelihood:Regression

log 𝑃 𝑌 𝑿 =@−12𝜎B 𝑦( − 𝒘2𝒙( B −

12 log 2𝜋 − log 𝜎

(𝜕𝜕𝒘𝑃 𝑌 𝑿 =@

1𝜎B 𝑦( − 𝒘2𝒙( 𝒙(2

(

=! 0

@ 𝑦( − 𝒘2𝒙( 𝒙(2�

(

= 0

𝒘bc = @𝒙(𝒙(2�

(

E+

@𝑦(𝒙(

(11

Page 12: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

DataMatrixNotation

Usingmatrixnotationtheresultbecomesmoreaccessible:

@𝒙(𝒙(2�

(

= 𝑿𝑿2

@𝑦(𝒙(

(

= 𝑿𝒚

𝑿 = 𝒙+, 𝒙B, … , 𝒙,

𝒚 =

𝑦+𝑦B⋮𝑦,

𝒘bc = 𝑿𝑿2 E+𝑿𝒚

Standardleastsquaressolution!Pseudo-Inverseofmatrix𝑿

12

Page 13: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Shortcoming1:Outliers

• Outliersaffectresults1. Leastsquares:Outliersaffectthesquaredlossmassively2. Probabilistic:Gaussianhasverylowprobabilityforlargedeviations

13

Realproblem:Illuminationestimation

Toodark:sunglasses Robustestimation

Leastsquaressolutionstendtoequalize allerrors

Page 14: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Shortcoming:Overfitting

• Toomanyparametersleadtoundecidablemodelsormodelswhichcanexplainthedataperfectly (overfitting)

• Ingeneral,wehavemultiplesolutionswhichfitthedata

Modeltoosimple Modelfitsdata Overfittingtoocomplex 14

Illustrationwithfittingpolynomialsofdegree𝑀 (non-linearbasisfunctions)

Figs:BishopPRML,2006

Page 15: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Regularization

Asasolution,weintroducepriorassumptionsaboutthesolution𝒘Actually,wemakeourpriorassumptionsexplicit – youalwayshavethem

Wewanttoprefersmall𝒘: Itshouldshowatendencytowardslowerinfluenceofafeaturewhennotenoughdataisavailable

Desiredregularization

15Figs:BishopPRML,2006

Page 16: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

RegularizedRegression:MAP

Naturalwayofdealingwithpriorsinprobabilisticview:Maximum-a-Posteriori(MAP) estimate

𝒘ghi = argmax𝒘

𝑃 𝑌 𝑿,𝒘 𝑃 𝒘

𝑃(𝒘) = 𝒩 𝒘|0, 𝜎>B𝑰TheGaussianpriorisaverycommonchoice:Weprefersolutionswithasmallmagnitude.

𝒘ghi = argmax𝒘

[𝒩 𝑦( 𝒘2𝒙(, 𝜎B�

(

𝒩 𝒘|0, 𝜎>B𝑰

Thiswillleadtoregularizedleastsquares

16

Page 17: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

MAPEstimatelog 𝑃 𝑌 𝑿,𝒘 𝑃 𝒘 =

@−12𝜎B 𝑦( − 𝒘2𝒙( B −

12 log 2𝜋𝜎

B −12𝜎>B

𝒘 B −𝑑2 log 2𝜋𝜎>

B�

(

𝜕𝜕𝒘 log𝑃 𝑌 𝑿,𝒘 𝑃 𝒘 =

1𝜎B@(𝑦(−𝒘2𝒙()𝒙(2

(

−1𝜎>B

𝒘2 =! 0

𝒘ghi = @𝒙(𝒙(2�

(

+𝜎B

𝜎>B𝑰

E+

@𝑦(𝒙(2�

(

𝒘ghi = 𝑿𝑿2 + 𝜆𝑰 E+𝑿𝒚 𝜆 ≔𝜎B

𝜎>B

17Specialname: Ridgeregression

Page 18: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

RidgeRegression

• Parameter𝜆 needstobeadaptedtoproblemTypicallythroughcross-validation:optimizationontest/validationdataRarelythrough“real”priorknowledge

18

Desiredregularization ToostrongTooweak

Figs:BishopPRML,2006

Page 19: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

BayesianLinearRegression

Westillonlyselectasinglesolution.Aprobablybetteralternativewouldbetoconsideralloftheminaproperwayofaveraging.

• ComparetologisticregressionwithmanydecisionplanesDiscussedaveragingonlyconceptually– howtoactuallydoit?

• Conceptframework:BayesianinferenceDefinestheproperwayofaveraging–marginalization

• BayesianlinearregressionisaniceapplicationexamplewhichisstillfullytractableandillustratestheconceptverywellBayesianmethodstendtobecomeintractableformorecomplexmodels

19

Page 20: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

BayesianInferenceforRegression

Classification:Averagemanypossibledecisionplanes

Regression:Averagemanypossibleregressionlines

20Figs:BishopPRML,2006

Page 21: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

ProbabilisticSetup

• TheMAPestimatecanbeeasilyextendedtofullBayesiantreatment.Insteadoftakingthemaximumonly,weusethewholedistributionof𝒘

𝒘ghi = argmax𝒘

𝑃 𝒚 𝑿,𝒘 𝑃 𝒘

𝑃 𝒘|𝑿, 𝒚 ∝ 𝑃 𝒚 𝑿,𝒘 𝑃 𝒘

𝑃 𝒘|𝑿, 𝒚 =𝑃 𝒚 𝑿,𝒘 𝑃 𝒘

∫ 𝑃 𝒚 𝑿,𝒘 𝑃 𝒘 d𝒘

Thisisadistributionofvaluesof𝒘.Thisinterpretationmakes𝒘arandomvariable!

21

Page 22: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

PosterioroftheParameter

• Calculationofposteriorofourparameter𝒘:𝑃 𝒘|𝑿, 𝑌• ApplicationofBayesrule:

𝑃 𝒘|𝑿, 𝑌 =𝑃 𝑌 𝑿,𝒘 𝑃 𝒘

∫ 𝑃 𝑌 𝑿,𝒘 𝑃 𝒘 d𝒘

Thenormalizationmeasureshowlikelythedatasetisonaverage,consideringallvaluesof𝒘:marginallikelihood𝑃 𝑌 𝑿

Theprior 𝑃 𝒘 expressesourassumptionsweholdabout𝒘beforeseeingdata

𝑃 𝑌 𝑿,𝒘 measuresthelikelihood ofthedatasetforansingle valueof𝒘

Theposterior 𝑃 𝒘|𝑿, 𝑌 expressesourcertaintywehaveaboutaspecificvalueof𝒘 – consideringdataand priorassumptions

22

Page 23: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

PosterioroftheParameterWenowhavetheposteriordistributioninsteadofasinglebestvalue.Itcontainsourknowledgeaboutthecompatibilityofallpossiblesolutionswithourdataandassumptions.

• Whatisitgoodfor?Itexpressesourcertaintyaboutallpossiblesolutions– “Rating” foreachsolutionSinglemaximum?Peaked?Broad?– ValuableinformationSystemintegration:Down-streammethodscanaccountforregressionuncertainty

• Whattodowithit?WecanuseallthisinformationtomakemoreinformedpredictionsAnanalysis(e.g.riskfactors)of𝒘 hasmoreinformationavailable

23

𝑃 𝒘 𝑃 𝒘|𝐷

Page 24: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

BayesianInference

𝑃 𝒘 𝐷 =1𝑍 𝑃 𝑌 𝑿,𝒘 𝑃 𝒘

=1𝑍[𝒩 𝑦( 𝒘2𝒙(, 𝜎B

(

𝒩 𝒘|0, 𝜎>B𝑰

=1𝑍′ exp −

12𝜎B 𝒘2𝑿 − 𝒚 B −

12𝜎>B

𝒘 B

𝑃 𝒘 𝐷 = 𝒩 𝒘|𝝁, 𝚺 𝝁 =1𝜎B 𝚺𝑿𝒚

𝚺E+ =1𝜎B 𝑿𝑿

2 +1𝜎>B

𝑰TheposteriorisagainaGaussian!

𝝁 = 𝒘ghi

Trainingdata𝐷 = 𝑿, 𝒚

24Bishop,PRML,section3.3.1,p.152– 156(eq 3.49– 3.54),Springer2006

Page 25: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

PosteriorofLinearRegression

Nodata

N=1

N=2

N=19

𝑃 𝒘 𝐷 = 𝒩 𝒘|𝝁, 𝚺

𝝁 =1𝜎B 𝚺𝑿𝒚

𝚺E+ =1𝜎B 𝑿𝑿

2 +1𝜎>B

𝑰

25Figs:BishopPRML,2006

Page 26: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

PredictiveDistribution

Howtopredictalabelforanewdatapoint?Wenowhaveverymanysolutionsandknowhowwelleachonefitsourtrainingdataandourpriorassumptions.

• Predictionisprobabilistic(posteriorforprediction/classification)• Thepredictionshouldincludeallourknowledgeaboutpossiblesolutions(should“average”overparametervalues):𝑃 𝑦 𝒙, 𝐷

• Weonlyhaveapredictionforasinglevalueof𝒘:𝑃 𝑦 𝒙,𝒘

• Averagingshouldrespectdifferentqualityof𝒘:𝑃(𝒘|𝐷)Badsolutionsshouldnotcontributewhilewewanttofocusongoodones

26

Page 27: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

PredictiveDistribution(II)

Example:polynomialfit(basisfunctions)• Blue:datapoints

• Greenline:generatingprocess/groundtruth

• Redline:bestfittobluedatapoints

• Shadedred:regionofprobableprediction

27

Tellsusabouttheoutcome’scertainty!

Figs:BishopPRML,2006

Page 28: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

PredictiveDistribution:Calculation

Theaveragingmethodiscalledmarginalization

𝑃 𝑦 𝒙, 𝐷 = ∫ 𝑃 𝑦 𝒙,𝒘 𝑃 𝒘 𝐷 d𝒘PredictiveDistribution

𝑃 𝑦 𝒙, 𝐷 = v𝒩 𝑦 𝒘2𝒙, 𝜎B 𝒩 𝒘|𝝁, 𝚺 d𝒘�

𝑃 𝑦 𝒙, 𝐷 = 𝒩 𝑦 𝝁2𝒙, 𝜎B + 𝒙2𝚺𝒙𝝁 =

1𝜎B 𝚺𝑿𝒚

𝚺E+ =1𝜎B 𝑿𝑿

2 +1𝜎>B

𝑰

Expected/Bestpredictionstilllinear

28Bishop,PRML,section3.3.2,p.156(eq 3.57– 3.59),Springer2006

Page 29: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

PredictiveDistribution:Result

• Predictionmeanislinear𝝁2𝒙• Predictionvarianceisaquadraticfunction𝜎B + 𝒙2𝚺𝒙

Thepredictionnowincludesaqualityestimate togetherwiththeactualprediction!• Thequalityishigherwherewehavemoredata• Thecertaintyneverincreasesbeyondourobservations’uncertainty𝜎

29

Page 30: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Uncertainty

• Wecalculatedmanyprobabilities.Howaretheytobeinterpreted?Theyaresometimescontradictory:Whydoesthedistributionchangewhenwehavemoredata?Shouldn’ttherebeareal distributionof𝑃 𝒘 ?

• Bayesianinferencereliesonasubjectiveperspective:Probabilityisusedtoexpressourcurrentknowledge.Itcanchangewhenwelearnorseemore:Withmoredata,wearemorecertainaboutourresult.

• Notsubjectiveinthesensethatitisarbitrary!Therearequantitativerulestofollowmathematically

• Probabilityexpressesanobserverscertainty,oftencalledbelief

30

Subjectivity:Thereisnosingle,realunderlyingdistribution.Aprobabilitydistributionexpressesourknowledge– Itisdifferentindifferentsituationsandfordifferentobserverssincetheyhavedifferentknowledge.

Page 31: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

BayesianInference

Bayesianinferenceisthemathematicaltooltocalculatechangesincertaintywhentheunderlyingknowledgechangesthroughobservations:beliefdynamics,beliefupdate

EvolutionofbeliefsbyconditioningondataaccordingtoBayesrule:

𝑃 𝑥 → 𝑃 𝑥 𝐷 𝑃 𝑥 𝐷 =𝑃 𝐷 𝑥 𝑃(𝑥)

𝑃(𝐷)

𝑃 𝑥 → 𝑃 𝑥 𝐷+ → 𝑃 𝑥 𝐷B → ⋯

Conditioningisdonewithalikelihoodmodel:Howcandatabeexplained?

31

Page 32: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

KernelRegression

Non-linearextensionispossibleandpowerful:

• Kernelsandsample-spaceexpansion:KernelregressionExpansion&kerneltrick

𝑤2𝑥 =@𝛼(𝒙(2𝒙�

(

𝑦 =@𝛼(𝑘 𝒙(2𝒙�

(

• Withregularization(MAPwithGaussian):KernelRidgeRegressionLeastsquaressolution:

𝜶∗ = 𝑲 + 𝜆𝑰 E+𝒚

32

𝑲(~ = 𝑘 𝒙(, 𝒙~

Page 33: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

KernelRidgeRegression

• Non-linearfittingwithproperkernel

33

scikit-learn.org,JanHendrikMetzen

KRR:KernelRidgeRegressionGPR:GaussianProcessRegression

Page 34: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

GaussianProcess

• 𝑃 𝒘 actuallydescribesadistributionoffunctions:Everysinglevalueof𝒘 definesalinearfunction:

• GaussianProcess:GaussiandistributionoverfunctionsDirectlymodelsthedistributionofourregressionfunctions𝑓Gaussian:Theyhaveamean 𝜇 𝒙 andacovariance 𝑘 𝒙, 𝒙�

• Covariancefunctionsareessentiallythesamethingasakernel:Theyspecifyasimilaritymeasurebetweenpoints𝑥(“Similar”→ highcorrelation)

34

𝑓 𝒙 ∼ 𝐺𝑃 𝜇 𝒙 , 𝑘 𝒙, 𝒙�

𝑓𝒘 𝒙 = 𝒘2𝒙 𝑃(𝒘) → 𝑃 𝑓𝒘

Page 35: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

GaussianProcess(II)

• FullyBayesiantreatmentofmanyproblemsduetotheGaussianstructure:Closed-formsolutionsavailableUsuallyneedsGaussianlikelihoodsaswell

• Duetotheuseofcovariancefunctions(~kernels)thisisalsotrueforverycomplexnon-linearmodels

• Verypowerfulframework:• Non-linearBayesianregressionformachinelearning• E.g.Fullshapemodelswhichcombinestatisticalapproaches(“PCA”)withmoregeneralassumptions,like”smoothness”

35

Page 36: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

GaussianProcess:Shapes

36

Lüthi,Marcel,etal."GaussianProcessMorphable Models."arXiv preprintarXiv:1603.07254 (2016).

http://shapemodelling.cs.unibas.ch/https://www.futurelearn.com/courses/statistical-shape-modelling

Page 37: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

BayesianModelSelection

• Bayesianmethodsaverageoverwholemodelclasses𝑀e.g.averageoverall𝒘 valuesforagivenpolynomialdegree𝑀

• Themarginallikelihoodcapturestheaveragefitofamodel𝑀 togivendata𝐷,evidence foragivenmodelclass:

• Differentmodelscanbecomparedwithrespecttotheirmarginallikelihoods:Findmodelswhichfitadatasetwellonaverage

• Modelselection:selectbestone• Model“averaging”:predictwithweightedaverageoverallmodels

37

𝑃 𝐷 𝑀 = ∫ 𝑃 𝐷 𝒘,𝑀 𝑃 𝒘 𝑀 d𝒘

Page 38: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

BayesianModelSelection(II)

• Themarginallikelihoodisanormalized distribution:Itmeasuresthedegreeoffittodatavs.complexity ofthemodelNormalizationhasanaturalregularization effect:Incomplexmodels,manydatasetsareverylikelybecausethereisasuitableparameterwhichexplainsthedatawell,e.g.highdegreepolynomials.Butiftherearemany datasetswithahighlikelihood,thelikelihoodforanindividual datasetisratherlowbecausethemarginallikelihoodisnormalized.

38

𝑀+:simplemodel𝑀B:intermediatemodel𝑀�:complexmodel

“Area“isalways1

Figs:BishopPRML,2006

Page 39: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

BayesianModelSelection(III)

• Evidenceforthepolynomialexample:

39

Mod

elevidence(lo

g)

Figs:BishopPRML,2006

Page 40: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Summary:Regression

• Regression:Machinelearningwithcontinuouslabel

• Leastsquaresregression:MLestimationCorrespondstoGaussianobservationerror

• Regularization:MAPestimation• Reducesoverfitting• Regularizedleastsquares:Ridgeregression

• BayesianRegression• Posteriordistributionofregressionmodels:𝑃 𝒘 𝐷• Averageallmodelsforaprediction:predictivedistribution 𝑃 𝑦 𝒙, 𝐷• UncertaintytreatmentwithBayesianinference:beliefupdates

40

Page 41: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Summary:ProbabilisticMethods(I)

BayesClassifier

NaïveBayes

LogisticRegression

BayesianRegressionLeastSquaresRegression

SVM,DecisionTree,ANN,Perceptron

Classification

Regressio

n

ProbabilisticMethods

DiscriminativeMethods

GaussianProcessGaussianProcess

bold:partofthisblockitalic:notinthislecture 41

Page 42: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Summary:ProbabilisticMethods(II)

• Build&constructmodel,accordingtoideaandconcept𝑃 𝑦 𝒙,𝑤

• Estimateparameters:MaximumLikelihood• Theidiomaticprobabilisticwayoflearning

• Realizeshortcomingsofresultduetoalackofdata

• Estimationwithpriorknowledge:MAPestimation• MAPincludesourknowledgeabouttheproblemintoestimation

• FullBayesiantreatment• Expresscertaintybyconsideringallpossiblesolutions• Weightedaveraging:weightisdegreeoffitwithtrainingdata

42

Page 43: Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

Summary:ProbabilisticMethods(III)Method NaïveBayes,Bag-of-words LogisticRegression LinearRegression

Model 𝑃 𝑦 w ∝[𝑃 w 𝑦�

𝑃 𝑦

𝑃 𝑤 𝑦 = ℎ>,�

𝑃 𝑦 𝒙 =1

1 + exp −𝒘2𝒙 𝑃 𝑦 𝒙 = 𝒩 𝑦|𝒘2𝒙, 𝜎B

ML estimateℎ> =

𝑁�∑ 𝑁>�>

@ 𝑦( −𝜎 𝒘2𝒙( +𝑤5 𝒙(2�

(=! 0

IterativeReweighted LeastSquares𝒘g� = 𝑿𝑿2 E+𝑿𝒚

ShortcomingofML

Unseenwords:zerocounts

Separabledata:infinitecertainty

Underdetermined solution&overfitting

Priorknowledge Pseudocount:eachwordalready seenonce

Small𝒘: minimalinfluenceofafeature:𝑃 𝒘 = 𝒩 𝒘|0, 𝜎B𝑰

Small𝒘: minimalinfluenceofafeature

𝑃 𝒘 = 𝒩 𝒘|0, 𝜎>B𝑰MAPestimate ℎ> =

𝑁� + 1∑ (𝑁>+1)�>

@ 𝑦( − 𝜎 𝒘2𝒙( + 𝑤5 𝒙(2�

(

−1𝜎B 𝒘

2 =! 0

IterativeReweighted LeastSquares𝒘ghi = 𝑿𝑿2 + 𝜆𝑰 E+𝑿𝒚

Bayes𝑃 𝒘|𝐷𝑃 𝑦 𝒙, 𝐷

(didnotdiscuss)LatentDirichlet Allocation

Averageclassifiersresultingfromallpossibleclassificationhyperplanes

Averageoverall𝒘,weightedbytheirperformanceofexplainingourtrainingdata:Gaussianwithlinearmeanandquadraticcovariance