statistical machine learning on large scale datakrishnarajpm.com/bigdata/pramod.pdf · a cousin-...

41
Statistical Machine Learning on Large Scale Data By Pramod N

Upload: others

Post on 11-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Statistical Machine Learning on Large Scale

Data

By

Pramod N

Page 2: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Questions!!? What? Why? Where? How? ML-201

Page 3: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Why would machine learning techniques work?

Data Data Data – Nature of Data

Modelling of different kinds of problem to

form relevant data

Size of Data available for Training

Page 4: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

What?: Statistical Machine Learning Exploits the nature of data Solution to cope with uncertainty Noisy data but fall under a distribution Employs basic probability decisions( Bayes

Theorem)

Page 5: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Why?: Use Statistical ML

Models are mature Mathematical support to optimize Can handle multi dimensionality

efficiently

Page 6: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Types and Models

SVM- Support Vector MachinesLibsvmliblinear

GMM- Gaussian Mixture ModelsK-meansISODATA

Page 7: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Support Vector Machines Mechanism Developed by Vladimir N.

Vapnik as early as 1979 current implementations are based on

Vapnik and Corinna Cortes’ work in 1995

Separation by hyperplanes Maximize margin of classification

Page 8: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Classification

Choice of boundary

8

Are these really “equally valid”?

How?: SVMs

Hyperplane

Page 9: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Max Margin

How can we pick which is best?

Maximize the size of the margin.

9

Are these really “equally valid”?

Small Margin

Large Margin

Hyperplane

Page 10: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Support Vectors

Support Vectors are those input points (vectors) closest to the decision boundary

They are vectors They “support” the

decision hyperplane

10

Page 11: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Support Vectors

The decision hyperplane:

Decision and Margin Function:

11

Page 12: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

How do we represent the size of the margin in terms

of w? There must at least one

point that lies on each support hyperplanes

12

If not, we could define a larger margin support hyperplane that does touch the nearest point(s).

Page 13: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Summary

Page 14: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Summary- Multi Class

Page 15: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

libSVM

an integrated software for support vector classification

supports multi-class classification Both C++ and Java sources Python, R, MATLAB, Perl, Ruby, Weka,

Common LISP, CLISP, Haskell, LabVIEW, and PHP interfaces. C# .NET code and CUDA extension is available.

Page 16: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

libSVM- command line options

svm-train [input] [options]optimization finished, #iter = 87

nu = 0.471645

obj = -67.299458, rho = 0.203495

nSV = 88, nBSV = 72

Total nSV = 88

svm-predict [input] [model] [output]Output labels written to output

Accuracy = 83% (83/100) (classification)

Page 17: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

libSVM-Input Format1 1:0 2:1 3:1 4:239 5:486 6:8 7:8 8:2 9:2 10:7 11:6

1 1:2 2:1 3:3 4: 5:486 6:8 7:8 8:2 9:2 10:7 11:6

1 1:0 2:1 3:5 4:1367 5:335 6:3 7:3 8:2 9:2 10:21 11:72

6 1:182 2:1 3:6 4:1511 5:2957 6:1 7:1 8:2 9:2 10:1 11:3

1 1:0 2:1 3:1 4:239 5:486 6:8 7:8 8:2 9:2 10:7 11:6

6 1:2 2:1 3:3 4: 5:486 6:8 7:8 8:2 9:2 10:7 11:6

1 1:0 2:1 3:5 4:1367 5:335 6:3 7:3 8:2 9:2 10:21 11:72

6 1:182 2:1 3:6 4:1511 5:2957 6:1 7:1 8:2 9:2 10:1 11:3

[label] [index1]:[value1] [index2]:[value2] ...

[label] [index1]:[value1] [index2]:[value2] ...

Page 18: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

libSVM- Types -S <int> Set type of SVM (default: 0)

0 = C-SVC 1 = nu-SVC 2 = one-class SVM 3 = epsilon-SVR 4 = nu-SVR

C-svc : binary classification One class svm : anomaly detection,high-

dimensional distribution -ϵ Support Vector Regression ( -ϵ SVR), nu-

SVR

Page 19: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

libSVM- kernels

K <int> Set type of kernel function (default: 2)

○ 0 = linear: u'*v ○ 1 = polynomial: (gamma*u'*v + coef0)^degree○ 2 = radial basis function: exp(-gamma*|u-v|^2)○ 3 = sigmoid: tanh(gamma*u'*v + coef0)

Page 20: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes a library and

command-line tools for the learning task. libSVM and LIBLINEAR share similar usage as

well as application program interfaces (APIs) and hence ease of use

models after training are quite different (in particular, LIBLINEAR stores w in the model, but LIBSVM does not.)

Page 21: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

When to Use LIBLINEAR and libSVM

Number of instances << number of features - linear kernel or LIBLINEAR preferred

Both numbers of instances and features are large- LIBLINEAR is preferred

Number of instances >> number of features – non-linear kernels are preferred(RBF)

Page 22: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Where to find help?

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

http://www.csie.ntu.edu.tw/~cjlin/liblinear/

Documentation API Tools Datasets

Page 23: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

GMM- Gaussian Mixture Models

Gaussian Distribution

Carl Friedrich Gauss invented the normal distribution in 1809 as a way to rationalize the method of least squares

Page 24: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

GMM

D=1

D=2

Page 25: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

7

What is a Gaussian mixture model?

• Problem:

Given a set of data X ={x1,x2,...,xN} drawn from anunknown distribution (probably aGMM), estimate the

parameters θ of the GMM model that fits the data.

• Solution:

Maximize the likelihood p(X |θ) of the data with regard

to the model parameters?

Page 26: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

The Expectation-Maximization algorithm

• Basic ideas of the EM algorithm:

the maximization of the likelihood.-Introduce a hidden variable such that its knowledge would simplify

- At each iteration:

• E-Step: Estimate the distribution of the hidden variable giventhe data and the current value of the parameters.

• M-Step: Maximize the joint distribution of the data and thehidden variable.

8

One of the most popular approaches to maximize the likelihoodis to use theExpectation-Maximization (EM) algorithm.

Page 27: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

9

The EM for the GMM (graphical view 1)

Hidden variable: for each point, which Gaussian generated it?

Page 28: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

10

The EM for the GMM (graphical view 2)

E-Step: for each point, estimate the probability that each Gaussiangenerated it.

Page 29: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

11

The EM for the GMM (graphical view 3)

M-Step: modify the parameters according to the hidden variable tomaximize the likelihood of the data (and the hidden variable).

Page 30: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

ISODATA

What is ISODATA? Iterative Self- Organizing Data Analysis Technique

Page 31: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Properties ISODATA is a method of semi-

unsupervised classification Don’t need to know the number of

clusters Algorithm splits and merges clusters User defines threshold values for

parameters Algorithm runs for many iterations until

threshold is reached

Page 32: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

How ISODATA works? Cluster centers are randomly placed and

pixels are assigned based on the shortest distance to center method

The standard deviation within each cluster, and the distance between cluster centers is calculated

Clusters are split if one or more standard deviation is greater than the user-defined threshold

Clusters are merged if the distance between them is less than the user-defined threshold

Page 33: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Visualize

Page 34: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

ISODATA- Continued

A second iteration is performed with the new cluster centers

Further iterations are performed until:the average inter-center distance falls below

the user-defined thresholdthe average change in the inter-center

distance between iterations is less than a threshold

the maximum number of iterations is reached

Page 35: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

ALSO!!! Highlights

Clusters associated with fewer than the user-specified minimum number of pixels are eliminated

Lone pixels are either put back in the pool for reclassification, or ignored as “unclassifiable”

Page 36: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Drawbacks of ISODATA

May be time consuming if data is very unstructured

Algorithm can spiral out of control leaving only one class

Page 37: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Advantages of ISODATA

Don’t need to know much about the data beforehand

Little user effort required ISODATA is very effective at identifying

spectral clusters in data

Page 38: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Other Tools

Apache Mahout- Scalable machine learning libraryImproved clustering and classification

algorithm

R- with hadoop pluginImplentation of most of the standard

algorithm

Page 39: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Challenges

Sparse Learning in High Dimensions Semi-Supervised Learning Computation and Risk Structured Prediction Heavily dependent on PAST!!

Page 40: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Future Scope

Lots of area can benefit from statistical learning

More parallel implementations of algorithmsOn massively parallel platformsDistributed platformsGPU computing

Page 41: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes

Thank You