fundamental supervised machine learning...

1
Fundamental Supervised Machine Learning Models Kent Gauen and Dr. Xiao Wang Department of Statistics, Purdue University, West Lafayette, IN Introduction I Common-place tool to solve complex tasks like natural language processing and image classification (de Freitas (2015),Ng,Hinton (2012)) I Significant tool in bio-informatics, biology, quality control, AI applications, data compression, and many more... I Deep learning outperforms people on ”human” skills, such as the board game Go and some Atari games like Space Invaders I Fill the gap of understanding between undergraduates in their research and the machine learning tools they use I Investigate various model performances to distinguish strengths and weakness Basics I Models: the system of weights and bias terms used to convert input into output I Cost Function (criterion): quantitatively measures the performance of the model, ”how well it does” I Training: the process of learning model weights and bias terms such that they minimize the cost function I Generalization: how well a model performs on new data I Over-fitting: low cost on data used for training, but large cost for new data I Regularization: discourages over-fitting Data: Training, Cross Validation, Testing Data-sets split into 3 sets: training set, cross validation, and testing set General data-set split: 80% - 10% - 10% I Training Set: tunes model parameters I Cross Validation: regularizes model to prevent over-fitting I Testing Set: tests the generalization of the model Data pairs: ( ~ x i , ~ y i ) for i th example from m total in the data set MNIST Data set I Little to no pre-processing required (compliments to Lecun et al. (1998)) I Straight-forward task: what class is this digit? I Non-trivial task: issues of scale and location invariance I Common performance benchmark for new image classification methods ~ x i : 1 × 28 2 or 1 × 784 input vector ~ y i : 1 × k , k = 10 label vector where y i ,j ∈{0, 1} and j y i ,j = 1 Acknowledgments The research of the authors is supported by NSF grant DMS #1246818. Neural Networks Input 1 2 3 4 Hidden Layers 1 2 3 4 5 1 2 3 4 5 Predictions 1 2 I Characterized by model structure and cost function I Non-linear activation functions create complex decision boundaries characteristic to neural network models I = w oj + w 1j * x 1 + w 2j * x 2 + ... + w nj * x n and o j = φ( ), where o j is the output from the j th node in a layer. Logistic Regression 1 2 3 4 5 6 1 2 3 4 Input Predictions Cost 1 Model and Cost Function Cross Entropy Θ ( ~ x i , ~ y i ) = - m X i =1 k X j =1 y i ,j ln(π i ,j )+ λkθ j k 2 Soft Max Θ ( ~ x i ) = π i ,j = e -θ T j ~ x i j e -θ T j ~ x i Sigmoid Function σ (z )= 1 1 + e -z I ”Compresses” output between [0, 1] I Enables probabilistic interpretation of discrete functions, such as ”yes” or ”no” in a binomial setting I Foundation for classification and feature detection Convolutional Neural Network Input Image Convolved Image Pooled Image A B C D AB CD I A neural network which includes convolutional layers I Filters (3 × 3 filter above) convolve across input image I Type of feature extraction, reducing need for hand-engineered features Support Vector Machine x 2 x 1 I Max-margin classifier: the ”best” linear decision boundary I Kernel-trick: transforming data into linearly separable space I Hinge Loss: allows for some misclassification if data classes overlap Hinge Loss Θ = 1 n n X i =1 max 0, 1 - y i ~ θ * ~ x i - b + λk ~ θ j k 2 Kernel - trick : ~ x i l i and f j = exp - k ~ x i - l j k 2σ 2 ! [Gaussian] Model Fitness Overview 1 10 100 0 10 20 30 40 50 60 70 80 90 100 Log[Cost(x)] # of epochs Training Cross Validation Logistic Regression 0.001 0.01 0.1 1 0 10 20 30 40 50 60 70 80 90 100 Log[Cost(x)] # of epochs Training Cross Validation Support Vector Machine* 0.1 1 10 100 0 5 10 15 20 25 30 35 40 45 50 Log[Cost(x)] # of epochs Training Cross Validation Multi-Layer Perceptron 0.0001 0.001 0.01 0.1 1 0 20 40 60 80 100 120 140 160 Log[Cost(x)] # of epochs Training Cross Validation Convolutional Neural Network Results on MNIST Model Training Error Testing Error CV Error # Misclassified Logit 91.25% 88.79% 88.13% 6683 SVM 99.00% 96.69% 96.29% 1202 MLP 99.12% 96.74% 96.47% 1119 CNN 99.99% 98.74% 98.69% 307 Error = # correctly classified # in data subset References N. de Freitas. Machine learning, 2015. URL https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/. G. Hinton. Neural networks for machine learning, June 2012. URL https://class.coursera.org/neuralnets-2012-001. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998. A. Ng. Machine learning. URL https://www.coursera.org/learn/machine-learning.

Upload: lyphuc

Post on 18-Aug-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fundamental Supervised Machine Learning Modelsllc.stat.purdue.edu/~gauenk/publications/2016SBD_UMICH.pdfI Generalization: how well a model performs on new data I Over-fitting: low

Fundamental Supervised Machine Learning ModelsKent Gauen and Dr. Xiao Wang

Department of Statistics, Purdue University, West Lafayette, IN

Introduction

I Common-place tool to solve complex tasks like natural language processingand image classification (de Freitas (2015),Ng,Hinton (2012))

I Significant tool in bio-informatics, biology, quality control, AI applications, datacompression, and many more...

I Deep learning outperforms people on ”human” skills, such as the board gameGo and some Atari games like Space Invaders

I Fill the gap of understanding between undergraduates in their research and themachine learning tools they use

I Investigate various model performances to distinguish strengths and weakness

Basics

I Models: the system of weights and bias terms used to convert input into outputI Cost Function (criterion): quantitatively measures the performance of the model,

”how well it does”I Training: the process of learning model weights and bias terms such that they

minimize the cost functionI Generalization: how well a model performs on new dataI Over-fitting: low cost on data used for training, but large cost for new dataI Regularization: discourages over-fitting

Data: Training, Cross Validation, Testing

Data-sets split into 3 sets: training set, cross validation, and testing setGeneral data-set split: 80%− 10%− 10%I Training Set: tunes model parametersI Cross Validation: regularizes model to prevent over-fittingI Testing Set: tests the generalization of the modelData pairs: (~xi, ~yi) for i th example from m total in the data set

MNIST Data set

I Little to no pre-processing required (compliments to Lecun et al. (1998))

I Straight-forward task: what class is this digit?

I Non-trivial task: issues of scale and location invariance

I Common performance benchmark for new image classification methods

~xi : 1× 282 or 1× 784 input vector~yi : 1× k , k = 10 label vector where yi,j ∈ {0, 1} and

∑∀j yi,j = 1

Acknowledgments

The research of the authors is supported by NSF grant DMS #1246818.

Neural Networks

Input

1

2

3

4

Hidden Layers

1

2

3

4

5

1

2

3

4

5

Predictions

1

2

I Characterized by model structure and cost functionI Non-linear activation functions create complex decision boundaries

characteristic to neural network modelsI∑

= woj + w1j ∗ x1 + w2j ∗ x2 + . . . + wnj ∗ xn and oj = φ(∑

), where oj isthe output from the j th node in a layer.

Logistic Regression

1

2

3

4

5

6

1

2

3

4

Input

Predictions

Cost

1

Model and Cost Function

Cross EntropyΘ(~xi, ~yi

)=

−m∑

i=1

k∑j=1

yi,j ln(πi,j) + λ‖θj‖2

Soft MaxΘ(~xi)

= πi,j =e−θ

Tj ~xi∑

j e−θTj ~xi

Sigmoid Function

σ (z) =1

1 + e−z

I ”Compresses” output between [0, 1]I Enables probabilistic interpretation of

discrete functions, such as ”yes” or”no” in a binomial setting

I Foundation for classification andfeature detection

Convolutional Neural Network

Input ImageConvolved Image

Pooled Image

A B

C D

A BC D

I A neural network which includes convolutional layersI Filters (3× 3 filter above) convolve across input imageI Type of feature extraction, reducing need for hand-engineered features

Support Vector Machine

x2

x1I Max-margin classifier: the ”best”

linear decision boundaryI Kernel-trick: transforming data into

linearly separable spaceI Hinge Loss: allows for some

misclassification if data classesoverlap

Hinge LossΘ =

1n

n∑i=1

max(

0, 1− yi

(~θ ∗ ~xi − b

))+ λ‖~θj‖2

Kernel− trick : ~xi → li and fj = exp

(−‖~xi − lj‖

2σ2

)[Gaussian]

Model Fitness Overview

1

10

100

0 10 20 30 40 50 60 70 80 90 100

Log

[Cost

(x)]

# of epochs

TrainingCross Validation

Logistic Regression

0.001

0.01

0.1

1

0 10 20 30 40 50 60 70 80 90 100

Log

[Cost

(x)]

# of epochs

TrainingCross Validation

Support Vector Machine*

0.1

1

10

100

0 5 10 15 20 25 30 35 40 45 50

Log

[Cost

(x)]

# of epochs

TrainingCross Validation

Multi-Layer Perceptron

0.0001

0.001

0.01

0.1

1

0 20 40 60 80 100 120 140 160

Log

[Cost

(x)]

# of epochs

TrainingCross Validation

Convolutional Neural Network

Results on MNIST

Model Training Error Testing Error CV Error # MisclassifiedLogit 91.25% 88.79% 88.13% 6683SVM 99.00% 96.69% 96.29% 1202MLP 99.12% 96.74% 96.47% 1119CNN 99.99% 98.74% 98.69% 307

Error =# correctly classified

# in data subset

References

N. de Freitas. Machine learning, 2015. URLhttps://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/.

G. Hinton. Neural networks for machine learning, June 2012. URLhttps://class.coursera.org/neuralnets-2012-001.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. InProceedings of the IEEE, pages 2278–2324, 1998.

A. Ng. Machine learning. URL https://www.coursera.org/learn/machine-learning.