mlhep 2015: introductory lecture #2

67
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #2 Alex Rogozhnikov, 2015

Upload: arogozhnikov

Post on 21-Feb-2017

358 views

Category:

Science


1 download

TRANSCRIPT

Page 1: MLHEP 2015: Introductory Lecture #2

MACHINE LEARNING IN HIGHENERGY PHYSICSLECTURE #2

Alex Rogozhnikov, 2015

Page 2: MLHEP 2015: Introductory Lecture #2

RECAPITULATIONclassification, regressionkNN classifier and regressorROC curve, ROC AUC

Page 3: MLHEP 2015: Introductory Lecture #2

Given knowledge about distributions, we can build optimalclassifier

OPTIMAL BAYESIAN CLASSIFIER

=p(y = 1 | x)p(y = 0 | x)

p(y = 1) p(x | y = 1)p(y = 0) p(x | y = 0)

But distributions are complex, contain many parameters.

Page 4: MLHEP 2015: Introductory Lecture #2

QDA

QDA follows generative approach.

Page 5: MLHEP 2015: Introductory Lecture #2

LOGISTIC REGRESSION

Decision function d(x) =< w, x > +w0

Sharp rule: = sgn d(x)y ̂

Page 6: MLHEP 2015: Introductory Lecture #2

Optimizing weights to maximize log-likelihood

LOGISTIC REGRESSION

Smooth rule:

d(x) =< w, x > +w0

(x)p+1

(x)p+1

==σ(d(x))σ(+d(x))

w, w0

= + ln( ( )) = L( , ) → min1N ∑

i!events

pyixi

1N ∑

i

xi yi

Page 7: MLHEP 2015: Introductory Lecture #2

LOGISTIC LOSSLoss penalty for single observation

L( , ) = + ln( ( )) = {xi yi pyixi

ln(1 + ),e+d( )xi

ln(1 + ),ed( )xi

= +1yi

= +1yi

Page 8: MLHEP 2015: Introductory Lecture #2

GRADIENT DESCENT & STOCHASTICOPTIMIZATION

Problem: finding to minimize

is step size (also `shrinkage`, `learning rate`)

w

w ← w + η��w

η

Page 9: MLHEP 2015: Introductory Lecture #2

STOCHASTIC GRADIENT DESCENT = L( , ) → min

1N ∑

i

xi yi

On each iteration make a step with respect to only oneevent:

1. take — random event from training data

2.

i

w ← w + η�( , )xi yi

�wEach iteration is done much faster, but training process isless stable.

Page 10: MLHEP 2015: Introductory Lecture #2

POLYNOMIAL DECISION RULEd(x) = + +w0 ∑

i

wixi ∑ij

wijxixj

is again linear model, introduce new features:

and reusing logistic regression.

z = {1} C { C {xi}i xixj}ij

d(x) = ∑i

wizi

We can add as one more variable to dataset andforget about intercept

= 1x0

d(x) = + =w0 *Ni=1 wixi *N

i=0 wixi

Page 11: MLHEP 2015: Introductory Lecture #2

PROJECTING IN HIGHER DIMENSION SPACESVM with polynomial kernel visualization

After adding new features, classes may become separable.

Page 12: MLHEP 2015: Introductory Lecture #2

is projection operator (which adds new features).

Assume

and look for optimal

We need only kernel:

KERNEL TRICKP

d(x) = < w, P(x) >

w = P( )∑i

αi xi

αi

d(x) = < P( ), P(x) >= K( , x)∑i

αi xi ∑i

αi xi

K(x, y) =< P(x), P(y) >

Page 13: MLHEP 2015: Introductory Lecture #2

Popular kernel is gaussian Radial Basis Function:

Corresponds to projection to Hilbert space.

KERNEL TRICK

K(x, y) = e+c||x+y||2

Exercise: find a correspong projection.

Page 14: MLHEP 2015: Introductory Lecture #2

SVM selects decision rule with maximal possible margin.

SUPPORT VECTOR MACHINE

Page 15: MLHEP 2015: Introductory Lecture #2

SVM uses different loss function (only signal lossescompared):

HINGE LOSS FUNCTION

Page 16: MLHEP 2015: Introductory Lecture #2

SVM + RBF KERNEL

Page 17: MLHEP 2015: Introductory Lecture #2

SVM + RBF KERNEL

Page 18: MLHEP 2015: Introductory Lecture #2

OVERFITTING

Knn with k=1 gives ideal classification of training data.

Page 19: MLHEP 2015: Introductory Lecture #2

OVERFITTING

Page 20: MLHEP 2015: Introductory Lecture #2

There are two definitions of overfitting, which oftencoincide.

DIFFERENCE-OVERFITTINGThere is significant difference in quality of predictionsbetween train and test.

COMPLEXITY-OVERFITTINGFormula has too high complexity (e.g. too manyparameters), increasing the number of parameters drives tolower quality.

Page 21: MLHEP 2015: Introductory Lecture #2

MEASURING QUALITYTo get unbiased estimate, one should test formula onindependent samples (and be sure that no train informationwas given to algorithm during training)

In most cases, simply splitting data into train and holdout isenough.

More approaches in seminar.

Page 22: MLHEP 2015: Introductory Lecture #2

Difference-overfitting is inessential, provided that wemeasure quality on holdout (though easy to check).

Complexity-overfitting is problem — we need to testdifferent parameters for optimality (more examples throughthe course).

Don't use distribution comparison to detect overfitting

Page 23: MLHEP 2015: Introductory Lecture #2
Page 24: MLHEP 2015: Introductory Lecture #2

REGULARIZATIONWhen number of weights is high, overfitting is very probable

Adding regularization term to loss function:

= L( , ) + → min1N ∑

i

xi yi reg

regularization : regularization:

regularization:

L2 = α |reg *j wj |2

L1 = β | |reg *j wj

+L1 L2 = α | + β | |reg *j wj |2 *j wj

Page 25: MLHEP 2015: Introductory Lecture #2

, — REGULARIZATIONSL2 L1

L2 regularization L1 (solid), L1 + L2 (dashed)

Page 26: MLHEP 2015: Introductory Lecture #2

REGULARIZATIONS regularization encourages sparsityL1

Page 27: MLHEP 2015: Introductory Lecture #2
Page 28: MLHEP 2015: Introductory Lecture #2

REGULARIZATIONSLp

=Lp *i wpi

What is the expression for ?

But nobody uses it, even . Why?Because it is not convex

L0= [ y 0]L0 *i wi

, 0 < p < 1Lp

Page 29: MLHEP 2015: Introductory Lecture #2

LOGISTIC REGRESSIONclassifier based on linear decision ruletraining is reduced to convex optimizationother decision rules are achieved by adding new features stochastic optimization is usedcan handle > 1000 features, requires regularizationno iteraction between features

Page 30: MLHEP 2015: Introductory Lecture #2

[ARTIFICIAL] NEURAL NETWORKSBased on our understanding of natural neural networks

neurons are organized in networksreceptors activate some neurons, neurons are activatingother neurons, etc.connection is via synapses

Page 31: MLHEP 2015: Introductory Lecture #2

STRUCTURE OF ARTIFICIAL FEED-FORWARD NETWORK

Page 32: MLHEP 2015: Introductory Lecture #2

ACTIVATION OF NEURON

Neuron states: n = { 1,0,

activatednot activated

Let to be state of to be weight of connection between -th neuron and output neuron:

ni wii

n = { 1,0,

> 0*i wini

otherwise*i

Problem: find set of weights, that minimizes error on traindataset. (discrete optimization)

Page 33: MLHEP 2015: Introductory Lecture #2

SMOOTH ACTIVATIONS:ONE HIDDENLAYER

= σ( )hi ∑j

wijxj

= σ( )yi ∑i

vijhj

Page 34: MLHEP 2015: Introductory Lecture #2

VISUALIZATION OF NN

Page 35: MLHEP 2015: Introductory Lecture #2

NEURAL NETWORKSPowerful general purpose algorithm for classification andregressionNon-interpretable formulaOptimization problem is non-convex with local optimumsand has many parameters Stochastic optimization speeds up process and helps not tobe caught in local minimum.Overfitting due to large amount of parameters

— regularizations (and other tricks),L1 L2

Page 36: MLHEP 2015: Introductory Lecture #2

MINUTES BREAKx

Page 37: MLHEP 2015: Introductory Lecture #2

DEEP LEARNINGGradient diminishes as number of hidden layers grows.Usually 1-2 hidden layers are used.

But modern ANN for image recognition have 7-15 layers.

Page 38: MLHEP 2015: Introductory Lecture #2
Page 39: MLHEP 2015: Introductory Lecture #2

CONVOLUTIONAL NEURAL NETWORK

Page 40: MLHEP 2015: Introductory Lecture #2
Page 41: MLHEP 2015: Introductory Lecture #2

DECISION TREESExample: predict outside play based on weather conditions.

Page 42: MLHEP 2015: Introductory Lecture #2
Page 43: MLHEP 2015: Introductory Lecture #2

DECISION TREES: IDEA

Page 44: MLHEP 2015: Introductory Lecture #2

DECISION TREES

Page 45: MLHEP 2015: Introductory Lecture #2

DECISION TREES

Page 46: MLHEP 2015: Introductory Lecture #2

DECISION TREEfast & intuitive predictionbuilding optimal decision tree is building tree from root using greedy optimization

each time we split one leaf, finding optimal feature andthresholdneed criterion to select best splitting (feature, threshold)

NP complete

Page 47: MLHEP 2015: Introductory Lecture #2

SPLITTING CRITERIONS TotalImpurity = impurity(leaf ) × size(leaf )*leaf

Misclass.Gini

Entropy

===

min(p, 1 + p)p(1 + p)+ p log p + (1 + p) log(1 + p)

Page 48: MLHEP 2015: Introductory Lecture #2

SPLITTING CRITERIONSWhy using Gini or Entropy not misclassification?

Page 49: MLHEP 2015: Introductory Lecture #2

REGRESSION TREEGreedy optimization (minimizing MSE):

GlobalMSE U ( +*i yi y ̂ i)2

Can be rewritten as:

GlobalMSE U MSE(leaf) × size(leaf)*leaf

MSE(leaf) is like 'impurity' of leaf

MSE(leaf) = ( +1size(leaf) *i!leaf yi y ̂ i)2

Page 50: MLHEP 2015: Introductory Lecture #2
Page 51: MLHEP 2015: Introductory Lecture #2
Page 52: MLHEP 2015: Introductory Lecture #2
Page 53: MLHEP 2015: Introductory Lecture #2
Page 54: MLHEP 2015: Introductory Lecture #2
Page 55: MLHEP 2015: Introductory Lecture #2

In most cases, regression trees are optimizing MSE:

GlobalMSE U ( +∑i

yi y ̂ i)2

But other options also exist, i.e. MAE:

GlobalMAE U | + |∑i

yi y ̂ i

For MAE optimal value of leaf is median, not mean.

Page 56: MLHEP 2015: Introductory Lecture #2

DECISION TREES INSTABILITYLittle variation in training dataset produce differentclassification rule.

Page 57: MLHEP 2015: Introductory Lecture #2

PRE-STOPPING OF DECISION TREETree keeps splitting until each event is correctly classified.

Page 58: MLHEP 2015: Introductory Lecture #2

PRE-STOPPINGWe can stop the process of splitting by imposing differentrestrictions.

limit the depth of treeset minimal number of samples needed to split the leaflimit the minimal number of samples in leafmore advanced: maximal number of leaves in tree

Any combinations of rules above is possible.

Page 59: MLHEP 2015: Introductory Lecture #2

no prepruning max_depth

min # of samples in leaf maximal number of leaves

Page 60: MLHEP 2015: Introductory Lecture #2

POST-PRUNINGWhen tree tree is already built we can try optimize it tosimplify formula.

Generally, much slower than pre-stopping.

Page 61: MLHEP 2015: Introductory Lecture #2
Page 62: MLHEP 2015: Introductory Lecture #2
Page 63: MLHEP 2015: Introductory Lecture #2

SUMMARY OF DECISION TREE1. Very intuitive algorithm for regression and classification2. Fast prediction3. Scale-independent4. Supports multiclassification

But

1. Training optimal tree is NP-complex2. Trained greedily by optimizing Gini index or entropy (fast!)3. Non-stable4. Uses only trivial conditions

Page 64: MLHEP 2015: Introductory Lecture #2

MISSING VALUES IN DECISION TREES

If event being predicted lacks , we use prior probabilities.x1

Page 65: MLHEP 2015: Introductory Lecture #2

FEATURE IMPORTANCES

Different approaches exist to measure importance of featurein final model

Importance of feature quality provided by one featurey

Page 66: MLHEP 2015: Introductory Lecture #2

FEATURE IMPORTANCEStree: counting number of splits made over this featuretree: counting gain in purity (e.g. Gini) fast and adequatecommon recipe: train without one feature, compare quality on test with/without one feature

requires many evaluations

common recipe: feature shuffling

take one column in test dataset and shuffle them. Comparequality with/without shuffling.

Page 67: MLHEP 2015: Introductory Lecture #2

THE ENDTomorrow: ensembles and boosting