yanjun&qi&/&uva&cs&450150156501507& uvacs$4501$+$001 ... · 10/22/14 3...

31
10/22/14 1 UVA CS 4501 001 / 6501 – 007 Introduc8on to Machine Learning and Data Mining Lecture 16: Genera,ve vs. Discrimina,ve / Knearestneighbor Classifier / LOOCV Yanjun Qi / Jane, , PhD University of Virginia Department of Computer Science 10/22/14 1 Yanjun Qi / UVA CS 450101650107 Where are we ? Five major sec,ons of this course Regression (supervised) Classifica,on (supervised) Unsupervised models Learning theory Graphical models 10/22/14 2 Yanjun Qi / UVA CS 450101650107

Upload: others

Post on 16-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

1  

UVA  CS  4501  -­‐  001  /  6501  –  007  Introduc8on  to  Machine  Learning  and  

Data  Mining      

Lecture  16:  Genera,ve  vs.  Discrimina,ve  /  K-­‐nearest-­‐neighbor  Classifier  /  LOOCV  

Yanjun  Qi  /  Jane,  ,  PhD    

University  of  Virginia    Department  of  

Computer  Science      

10/22/14   1  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Where  are  we  ?  è    Five  major  sec,ons  of  this  course  

q   Regression  (supervised)  q   Classifica,on  (supervised)  q   Unsupervised  models    q   Learning  theory    q   Graphical  models    

10/22/14   2  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 2: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

2  

Where  are  we  ?  è    Three  major  sec,ons  for  classifica,on

•  We can divide the large variety of classification approaches into roughly three major types

1. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression, support vector machine, decisionTree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors

10/22/14   3  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

A  Dataset  for    classifica,on    

•  Data/points/instances/examples/samples/records:  [  rows  ]  •  Features/a0ributes/dimensions/independent  variables/covariates/predictors/regressors:  [  columns,  except  the  last]    •  Target/outcome/response/label/dependent  variable:  special  column  to  be  predicted  [  last  column  ]    

10/22/14   4  

Output as Discrete Class Label

C1, C2, …, CL

C  

C  

Genera,ve   argmaxC

P(C | X) = argmaxC

P(X,C) = argmaxC

P(X |C)P(C)

P(C |X) C = c1,⋅ ⋅ ⋅,cLDiscrimina,ve  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 3: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

3  

Mul8nomial  Naïve  Bayes  as    Stochas8c  Language  Models  

0.2 the

0.01 boy

0.0001 said

0.0001 likes

0.0001 black

0.0005 dog

0.01 garden

Model C1 Model C2

dog boy likes black the

0.0005 0.01 0.0001 0.0001 0.2 0.01 0.0001 0.02 0.1 0.2

0.2 the

0.0001 boy

0.03 said

0.02 likes

0.1 black

0.01 dog

0.0001 garden

10/22/14   5  

the boy likes the dog

0.2 0.01 0.0001 0.2 0.0005

Multiply all five terms

Genera,ve  

P(s|C2) P(C2) > P(s|C1) P(C1)

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

0.0

0.2

0.4

0.6

0.8

1.0e.g.  Probability  of  disease  

x

P (C=1|X)

10/22/14   6  

Discrimina,ve   Logis,c  regression  models  for  binary  target  variable  coded  0/1.  

P(c =1 x) = eα+βx

1+ eα+βx

ln P(c =1| x)P(c = 0 | x)!

"#

$

%&= ln

P(c =1| x)1−P(c =1| x)!

"#

$

%&=α +β1x1 +β2x2 +...+βpxp

logis,c  func,on    

Logit  func,on    

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 4: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

4  

Binary Logistic Regression In summary that the logistic regression tells us two things at once. •  Transformed, the “log odds” (logit) are linear.

•  Logistic Distribution

ln[p/(1-p)]

P (Y=1|x)

x

x 10/22/14   7  

This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x) predefined.

p   1-­‐p  

Odds=  p/(1-­‐p)  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Today  :  Relevant  classifiers  /  KNN  /  LOOCV  

ü  Logis,c  regression  (cont.)  ü  Naïve  Bayes  Gaussian  Classifier    ü  K-­‐nearest  neighbor    ü  LOOCV  

   

10/22/14   8  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 5: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

5  

9  

Mul,nomial  Logis,c  Regression  Model  

=

=

++===

−=++

+===

1

10

1

10

0

)exp(1

1)|Pr(

1,,1,)exp(1

)exp()|Pr(

K

l

Tll

K

l

Tll

Tkk

xxXKG

Kkx

xxXkG

ββ

ββ

ββ…

The  method  directly  models  the  posterior  probabili,es  as  the  output  of  regression  

Note  that  the  class  boundaries  are  linear  

x  is  p-­‐dimensional  input  vector    βk  is  a  p-­‐dimensional  vector  for  each  k    Total  number  of  parameters  is  (K-­‐1)(p+1)    

10/22/14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

10  

MLE  for  Logis,c  Regression  Training  

Let’s  fit  the  logis,c  regression  model  for  K=2,  i.e.,  number  of  classes  is  2  

l(β) = {logPr(Y = yi | X = xi )}i=1

N

= yi log(Pr(Y =1| X = xi ))+ (1− yi )log(Pr(Y = 0 | X = xi ))i=1

N

= (yi logexp(βT xi )1+ exp(βT xi )

)+ (1− yi )log1

1+ exp(βT xi ))

i=1

N

= (yiβT xi − log(1+ exp(β

T xi )))i=1

N

Training  set:  (xi,  yi),  i=1,…,N  

Log-­‐likelihood:  

We  want  to  maximize  the  log-­‐likelihood  in  order  to  es,mate  β  

xi  are  (p+1)-­‐dimensional  input  vector  with  leading  entry  1  β   is  a  (p+1)-­‐dimensional  vector  yi    =  1  if  Ci    =1;  yi    =  0  if  Ci    =0    

p(y | x)y (1− p)1−yFor  Bernoulli  distribu,on    

10/22/14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 6: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

6  

Newton-­‐Raphson  for  LR  (op,onal)  

0))exp(1)exp(()(

1=

+−=

∂∑=

N

iiT

T

i xxxylββ

ββ

(p+1)  Non-­‐linear  equa,ons  to  solve  for  (p+1)  unknowns        

Solve  by  Newton-­‐Raphson  method:  

where, ( ∂2l(β)

∂β∂βT ) = - xixiT ( exp(β

T xi )1+ exp(βT xi )

)( 11+ exp(βT xi )

)i=1

N

βnew ← β old −[( ∂2l(β)

∂β∂βT )]-1 ∂l(β)∂β

,

p(xi  ;  β)   1  -­‐  p(xi  ;  β)  10/22/14   11  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Newton-­‐Raphson  for  LR  (op,onal)  

),()( 1 pyXWXX TToldnew −+← −ββ

∂l(β)∂β

= (yi −exp(βT x)1+ exp(βT x)

)xii=1

N

∑ = XT (y− p)

( ∂2l(β)

∂β∂βT ) = −XTWX

So,  NR  rule  becomes:  

,,

1

2

1

)1(

2

1

−−+−−

⎥⎥⎥⎥

⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

=

byNNpbyNTN

T

T

y

yy

y

x

xx

X!!

,

))exp(1/()exp(

))exp(1/()exp())exp(1/()exp(

1

22

11

−−⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

+

+

+

=

byNNT

NT

TT

TT

xx

xxxx

p

ββ

ββ

ββ

!

)))exp((1

11)())exp((1

)exp((i

Ti

Ti

T

xxx

βββ

+−

+

));(1();( ofmatrix diagonal :

);( ofmatrix 1 :

ofmatrix 1 : ofmatrix 1 :

oldi

oldi

oldi

i

i

xpxpN NWxp Np

y Nyx) (pNX

ββ

β

−×

×

×

10/22/14   12  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 7: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

7  

Newton-­‐Raphson  for  LR…  

13  

•  Newton-­‐Raphson  –     

–  Adjusted  response  

–  Itera,vely  reweighted  least  squares  (IRLS)  

WzXWXXpyWXWXWXX

pyXWXX

TT

oldTT

TToldnew

1

11

1

)())(()(

)()(

−−

=

−+=

−+=

β

ββ

)(1 pyWXz old −+= −β

)()(minarg

)()(minarg

1 pyWpy

XzWXzT

TTTnew

−−←

−−←

β

ββββ

Re  expressing  Newton  step  as  weighted  least  square  step  

10/22/14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Today  :  Relevant  classifiers  /  KNN  /  LOOCV  

ü  Logis,c  regression  (cont.)  ü  Gaussian  Naïve  Bayes  Classifier  

§  Gaussian  distribu,on    §  Gaussian  NBC  §  LDA,  QDA    §  Discrimina,ve  vs.  Genera,ve      

ü  K-­‐nearest  neighbor    ü  LOOCV  

   

10/22/14   14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 8: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

8  

The  Gaussian  Distribu8on  

Courtesy:  hrp://research.microsos.com/~cmbishop/PRML/index.htm  

Covariance  Matrix  Mean  10/22/14   15  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Mul8variate  Gaussian  Distribu8on

•  A multivariate Gaussian model: x ~ N(µ,Σ) where Here µ is the mean vector and Σ is the covariance matrix, if p=2 µ = {µ1, µ2} Σ = •  The covariance matrix captures linear dependencies among the variables

var(x1) cov(x1,x2) cov(x1,x2) var(x2)

10/22/14   16  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 9: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

9  

MLE  Es8ma8on  for    Mul8variate  Gaussian

∑=

=n

iixn 1

•  We can fit statistical models by maximizing the probability / likelihood of generating the observed samples: L(x1, … ,xn | Θ) = p(x1 | Θ) … p(xn | Θ) (the samples are assumed to be independent) •  In the Gaussian case, we simply set the mean and the variance to the sample mean and the sample variance:

∑ −=

=n

ixin 1

22 )(1µσ

10/22/14   17  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

     

18  

Probabilis,c  Interpreta,on  of  Linear  Regression  

•  Let  us  assume  that  the  target  variable  and  the  inputs  are  related  by  the  equa,on:  

 where  ε  is  an  error  term  of  unmodeled  effects  or  random  noise  

•  Now  assume  that  ε  follows  a  Gaussian  N(0,σ),  then  we  have:  

•  By  independence  (among  samples)  assump,on:  

iiT

iy εθ += x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

221

σθ

σπθ

)(exp);|( iT

iii

yxyp x

⎟⎟

⎜⎜

⎛ −−⎟

⎞⎜⎝

⎛==

∑∏ =

=2

12

1 221

σ

θ

σπθθ

n

i iT

inn

iii

yxypL

)(exp);|()(

x

10/22/14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 10: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

10  

     

19  

Probabilis,c  Interpreta,on  of  Linear  Regression  (cont.)  

•  Hence  the  log-­‐likelihood  is:  

•  Do  you  recognize  the  last  term?        Yes  it  is:    

•  Thus  under  independence  assump,on,  residual  means  square  is  equivalent  to  MLE  of  θ  !  

∑ =−−=

n

i iT

iynl1

22 211

21 )(log)( xθ

σσπθ

∑=

−=n

ii

Ti yJ

1

2

21 )()( θθ x

10/22/14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Today  :  Relevant  classifiers  /  KNN  /  LOOCV  

ü  Logis,c  regression  (cont.)  ü  Gaussian  Naïve  Bayes  Classifier  

§  Gaussian  distribu,on    §  Gaussian  NBC  §  LDA,  QDA    §  Discrimina,ve  vs.  Genera,ve      

ü  K-­‐nearest  neighbor    ü  LOOCV  

   

10/22/14   20  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 11: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

11  

Gaussian Naïve Bayes Classifier  

10/22/14   21  

argmaxC

P(C | X) = argmaxC

P(X,C) = argmaxC

P(X |C)P(C)

ijji

ijji

ji

jij

jiij

cCcX

XcCXP

=

=

⎟⎟

⎜⎜

⎛ −−==

   whichfor    examples  of  X  values  attribute  of  deviation  standard  :C    whichfor    examples  of    values  attribute  of  (avearage)  mean  :

2)(

exp21)|(ˆ                                           2

2

σ

µ

σ

µ

σπ

P(X |C) = P(X1,X2, ⋅ ⋅ ⋅,Xp |C)= P(X1 | X2, ⋅ ⋅ ⋅,Xp,C)P(X2, ⋅ ⋅ ⋅,Xp |C)= P(X1 |C)P(X2, ⋅ ⋅ ⋅,Xp |C)= P(X1 |C)P(X2 |C) ⋅ ⋅ ⋅P(Xp |C)

Naïve  Bayes  Classifier    

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

22

Gaussian Naïve Bayes Classifier

•  Continuous-valued Input Attributes –  Conditional probability modeled with the normal distribution

–  Learning Phase: Output: normal distributions and

–  Test Phase: •  Calculate conditional probabilities with all the normal distributions •  Apply the MAP rule to make a decision

P̂(Xj |C = ci ) =1

2πσ ji

exp −(X j−µ ji )

2

2σ ji2

"

#$$

%

&''

µ ji : mean (avearage) of attribute values Xj of examples for which C = ciσ ji : standard deviation of attribute values X j of examples for which C = ci

for X = (X1, ⋅ ⋅ ⋅,Xp ), C = c1, ⋅ ⋅ ⋅,cLp× L

for !X = ( !X1, ⋅ ⋅ ⋅, !Xp )

P(C = ci ) i =1, ⋅ ⋅ ⋅,L

10/22/14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 12: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

12  

Naïve  Gaussian  means  ?    

10/22/14   23  

P(X1,X2, ⋅ ⋅ ⋅,Xp |C = cj ) = P(X1 |C)P(X2 |C) ⋅ ⋅ ⋅P(Xp |C)

=∏i

12πσ ji

exp −(X j−µ ji )

2

2σ ji2

$

%&&

'

())

P(X1,X2, ⋅ ⋅ ⋅,Xp |C) =

Naïve    

Not  Naïve    

Σ_ j = Λ _ jDiagonal  Matrix  Each  class’  covariance  matrix  is  diagonal    

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Today  :  Relevant  classifiers  /  KNN  /  LOOCV  

ü  Logis,c  regression  (cont.)  ü  Gaussian  Naïve  Bayes  Classifier  

§  Gaussian  distribu,on    §  Gaussian  NBC  §  LDA,  QDA,  RDA  §  Discrimina,ve  vs.  Genera,ve      

ü  K-­‐nearest  neighbor  ,    ü  LOOCV  

   

10/22/14   24  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 13: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

13  

If  covariance  matrix  not  Iden,ty  but  same  e.g.  è  LDA  (Linear  Discriminant  Analysis)  

10/22/14   25  

Class  k Class  l Class  k Class  l

Each  class’  covariance  matrix    is  the  same  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Op8mal  Classifica8on                -­‐  Note    

argmaxk

P(C _ k | X) = argmaxk

P(X,C) = argmaxk

P(X |C)P(C)

10/22/14   26  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 14: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

14  

10/22/14   27  

è  The  Decision  Boundary  Between  class  k  and  l,            {x  :  δk  (x)  =  δl(x)},  is  linear    

log P(C _ k | X)P(C _ l | X)

= log P(X |C _ k)P(X |C _ l)

+ log P(C _ k)P(C _ l)

Boundary  points  X  :  when  P(c_k|X)  ==  P(c_l|X),  the  les  linear  equa,on  ==0,  a  linear  line    

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

28

Visualization (three classes)

10/22/14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 15: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

15  

If  covariance  matrix  not  Iden,ty  not  same  e.g.  è  QDA  (Quadra,c  Discriminant  Analysis)  

10/22/14   29  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

LDA  on  Expanded  Basis

LDA  with  quadra,c  basis    Versus  QDA    

10/22/14   30  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 16: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

16  

Regularized  Discriminant  Analysis

10/22/14   31  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Today  :  Relevant  classifiers  /  KNN  /  LOOCV  

ü  Logis,c  regression  (cont.)  ü  Gaussian  Naïve  Bayes  Classifier  

§  Gaussian  distribu,on    §  Gaussian  NBC  §  LDA,  QDA    §  Discrimina,ve  vs.  Genera,ve      

ü  K-­‐nearest  neighbor    ü  LOOCV  

   

10/22/14   32  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 17: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

17  

LDA  vs.  Logis,c  Regression

10/22/14   33  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Discriminative vs. Generative

Height  

Gaussian  

Pr  

Logis,c  Regression  

Page 18: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

18  

Discriminative vs. Generative

●  Defini,ons  ○  hgen  and  hdis:  genera,ve  and  discrimina,ve  

classifiers  

○  hgen,  inf  and  hdis,  inf:  same  classifiers  but  trained  on  the  en,re  popula,on  (asympto,c  classifiers)  

○  n  →    infinity,  hgen  →  hgen,  inf  and  hdis  →  hdis,  inf  

Ng,  Jordan,.  "On  discrimina,ve  vs.  genera,ve  classifiers:  A  comparison  of  logis,c  regression  and  naive  bayes."  Advances  in  neural  informaHon  processing  systems  14  (2002):  841.  

Discriminative vs. Generative

Proposi,on  1:  

Proposi,on  2:  

-­‐  p  :  number  of  dimensions  -­‐  n  :  number  of  observa,ons  -­‐  ϵ  :  generaliza,on  error  

Page 19: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

19  

Logistic Regression vs. NBC

Discrimina,ve  classifier  (Logis,c  Regression)  -­‐  Smaller  asympto,c  error  -­‐  Slow  convergence  ~  size  of  training  set  O(p)  

Genera,ve  classifier  (Naive  Bayes)  -­‐  Larger  asympto,c  error  -­‐  Can  handle  missing  data  (EM)  -­‐  Fast  convergence  ~  size  of  training  set  O(lg(p))  

Genera,on  error  

Logis,c  Regression  

Naive  Bayes  

Size  of  training  set  

Page 20: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

20  

Xue,  Jing-­‐Hao,  and  D.  Michael  Tirerington.  "Comment  on  “On  discrimina,ve  vs.  genera,ve  classifiers:  A  comparison  of  logis,c  regression  and  naive  Bayes”."Neural  processing  le0ers  28.3  (2008):  169-­‐187.  

Genera,on  error  

Size  of  training  set  

Logistic Regression vs. NBC

●  Empirically,  genera,ve  classifiers  approach  their  asympto,c  error  faster  than  discrimina,ve  ones  ○  Good  for  small  training  set  ○  Handle  missing  data  well  (EM)  

●  Empirically,  discrimina,ve  classifiers  have  lower  asympto,c  error  than  genera,ve  ones  ○  Good  for  larger  training  set  

Page 21: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

21  

Today  :  Genera,ve  vs.  Discrimina,ve  /  KNN  /  LOOCV  

ü  Logis,c  regression  (cont.)  ü  Gaussian  Naïve  Bayes  Classifier  

§  Gaussian  distribu,on    §  Gaussian  NBC  §  LDA,  QDA    §  Discrimina,ve  vs.  Genera,ve      

ü  K-­‐nearest  neighbor  ,    ü  LOOCV  

   

10/22/14   41  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

•  Basic idea: –  If it walks like a duck, quacks like a duck, then

it’s probably a duck

Nearest neighbor classifiers

training samples

test sample

compute distance

choose k of the “nearest” samples

10/22/14   42  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 22: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

22  

Unknown recordNearest neighbor classifiers

Requires  three  inputs:  1.  The  set  of  stored  

training  samples  2.  Distance  metric  to  

compute  distance  between  samples  

3.  The  value  of  k,  i.e.,  the  number  of  nearest  neighbors  to  retrieve  

10/22/14   43  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Nearest neighbor classifiers

To  classify  unknown  sample:  1.  Compute  distance  to  

other  training  records  2.  Iden,fy  k  nearest  

neighbors    3.  Use  class  labels  of  nearest  

neighbors  to  determine  the  class  label  of  unknown  record  (e.g.,  by  taking  majority  vote)  

Unknown record

10/22/14   44  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 23: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

23  

Definition of nearest neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

       k-­‐nearest  neighbors  of  a  sample  x  are  datapoints  that  have  the  k  smallest  distances  to  x  

10/22/14   45  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

1-nearest neighbor Voronoi  diagram  

10/22/14   46  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 24: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

24  

•  Compute distance between two points: – For instance, Euclidean distance

•  Options for determining the class from nearest neighbor list – Take majority vote of class labels among the k-nearest neighbors

– Weight the votes according to distance •  example: weight factor w = 1 / d2

Nearest neighbor classification

∑ −=i

ii yxd 2)(),( yx

10/22/14   47  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

•  Choosing the value of k: –  If k is too small, sensitive to noise points –  If k is too large, neighborhood may include points

from other classes

Nearest neighbor classification

X

10/22/14   48  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 25: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

25  

•  Scaling issues – Attributes may have to be scaled to prevent

distance measures from being dominated by one of the attributes

– Example: •  height of a person may vary from 1.5 m to 1.8 m •  weight of a person may vary from 90 lb to 300 lb •  income of a person may vary from $10K to $1M

Nearest neighbor classification

10/22/14   49  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

•  Problem with Euclidean measure: – High dimensional data

•  curse of dimensionality

– Can produce counter-intuitive results

Nearest neighbor classification…

1 1 1 1 1 1 1 1 1 1 1 0

0 1 1 1 1 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1 vs  

d = 1.4142 d = 1.4142    u   one  solu,on:  normalize  the  vectors  to  unit  length  

10/22/14   50  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 26: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

26  

•  k-Nearest neighbor classifier is a lazy learner – Does not build model explicitly. – Unlike eager learners such as decision tree

induction and rule-based systems. – Classifying unknown samples is relatively

expensive. •  k-Nearest neighbor classifier is a local model,

vs. global model of linear classifiers.

Nearest neighbor classification

10/22/14   51  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Decision boundaries in global vs. local models

linear regression •  global •  stable •  can be inaccurate

15-nearest neighbor

1-nearest neighbor •  local •  accurate •  unstable

What ultimately matters: GENERALIZATION

10/22/14   52  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 27: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

27  

K-­‐Nearest-­‐Neighbours  for  Classifica,on  (2)  

K = 1 K = 3

10/22/14   53  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

K-­‐Nearest-­‐Neighbours  for  Classifica,on  (3)  

•   K  acts  as  a  smother  •   For                                ,  the  error  rate  of  the  1-­‐nearest-­‐neighbour  classifier  is  never  more  than  twice  the  op,mal  error  (obtained  from  the  true  condi,onal  class  distribu,ons).  

10/22/14   54  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 28: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

28  

Today  :  Genera,ve  vs.  Discrimina,ve  /  KNN  /  LOOCV  

ü  Logis,c  regression  (cont.)  ü  Gaussian  Naïve  Bayes  Classifier  

§  Gaussian  distribu,on    §  Gaussian  NBC  §  LDA,  QDA    §  Discrimina,ve  vs.  Genera,ve      

ü  K-­‐nearest  neighbor  ,    ü  LOOCV  

   

10/22/14   55  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

cross-­‐valida,on  (e.g.  K=3)  

•  k-­‐fold  cross-­‐valida,on  

Train   Test  

Dataset  

10/22/14   56  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 29: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

29  

Common  Spli�ng  Strategies  

•  Leave-­‐one-­‐out  (n-­‐fold  cross  valida,on)    

10/22/14   57  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

58

Leave-­‐one-­‐out  cross  valida8on

•  Leave-­‐one-­‐out  cross  valida8on  (LOOCV)  is      K-­‐fold  cross  valida,on  taken  to  its  logical  extreme,  with  K  equal  to  n,  the  number  of  data  points  in  the  set.    

•  That  means  that  n  separate  ,mes,  the  func,on  op,miza,on  is  trained  on  all  the  data  except  for  one  point  and  a  predic,on  is  made  for  that  point.    

•  As  before  the  average  error  is  computed  and  used  to  evaluate  the  model.  

10/22/14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 30: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

30  

CV-­‐based  Model  Selec,on    We’re  trying  to  decide  which  algorithm  to  use.    

 •  We  train  each  machine  and  make  a  table...      

10/22/14   59  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

60

Which  kind  of  cross-­‐valida,on  ?  

10/22/14  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

Page 31: Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

10/22/14  

31  

Today  Recap:  Genera,ve  vs.  Discrimina,ve  /  KNN  /  LOOCV  

ü  Logis,c  regression  (cont.)  ü  Gaussian  Naïve  Bayes  Classifier  

§  Gaussian  distribu,on    §  Gaussian  NBC  §  LDA,  QDA    §  Discrimina,ve  vs.  Genera,ve      

ü  K-­‐nearest  neighbor  ,    ü  LOOCV  

   

10/22/14   61  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07  

References    

q   Prof.  Tan,  Steinbach,  Kumar’s  “Introduc,on  to  Data  Mining”  slide  

q   Prof.  Andrew  Moore’s  slides  q   Prof.  Eric  Xing’s  slides  q   Has,e,  Trevor,  et  al.  The  elements  of  staHsHcal  learning.  Vol.  2.  No.  1.  New  York:  Springer,  2009.  

10/22/14   62  

Yanjun  Qi  /  UVA  CS  4501-­‐01-­‐6501-­‐07