statistical machine learning on large scale datakrishnarajpm.com/bigdata/pramod.pdf · a cousin-...

Statistical Machine Learning on Large Scale

Data

By

Pramod N

Questions!!? What? Why? Where? How? ML-201

Why would machine learning techniques work?

Data Data Data – Nature of Data

Modelling of different kinds of problem to

form relevant data

Size of Data available for Training

What?: Statistical Machine Learning Exploits the nature of data Solution to cope with uncertainty Noisy data but fall under a distribution Employs basic probability decisions( Bayes

Theorem)

Why?: Use Statistical ML

Models are mature Mathematical support to optimize Can handle multi dimensionality

efficiently

Types and Models

SVM- Support Vector MachinesLibsvmliblinear

GMM- Gaussian Mixture ModelsK-meansISODATA

Support Vector Machines Mechanism Developed by Vladimir N.

Vapnik as early as 1979 current implementations are based on

Vapnik and Corinna Cortes’ work in 1995

Separation by hyperplanes Maximize margin of classification

Classification

Choice of boundary

8

Are these really “equally valid”?

How?: SVMs

Hyperplane

Max Margin

How can we pick which is best?

Maximize the size of the margin.

9

Are these really “equally valid”?

Small Margin

Large Margin

Hyperplane

Support Vectors

Support Vectors are those input points (vectors) closest to the decision boundary

They are vectors They “support” the

decision hyperplane

10

Support Vectors

The decision hyperplane:

Decision and Margin Function:

11

How do we represent the size of the margin in terms

of w? There must at least one

point that lies on each support hyperplanes

12

If not, we could define a larger margin support hyperplane that does touch the nearest point(s).

Summary

Summary- Multi Class

libSVM

an integrated software for support vector classification

supports multi-class classification Both C++ and Java sources Python, R, MATLAB, Perl, Ruby, Weka,

Common LISP, CLISP, Haskell, LabVIEW, and PHP interfaces. C# .NET code and CUDA extension is available.

libSVM- command line options

svm-train [input] [options]optimization finished, #iter = 87

nu = 0.471645

obj = -67.299458, rho = 0.203495

nSV = 88, nBSV = 72

Total nSV = 88

svm-predict [input] [model] [output]Output labels written to output

Accuracy = 83% (83/100) (classification)

libSVM-Input Format1 1:0 2:1 3:1 4:239 5:486 6:8 7:8 8:2 9:2 10:7 11:6

1 1:2 2:1 3:3 4: 5:486 6:8 7:8 8:2 9:2 10:7 11:6

1 1:0 2:1 3:5 4:1367 5:335 6:3 7:3 8:2 9:2 10:21 11:72

6 1:182 2:1 3:6 4:1511 5:2957 6:1 7:1 8:2 9:2 10:1 11:3

1 1:0 2:1 3:1 4:239 5:486 6:8 7:8 8:2 9:2 10:7 11:6

6 1:2 2:1 3:3 4: 5:486 6:8 7:8 8:2 9:2 10:7 11:6

1 1:0 2:1 3:5 4:1367 5:335 6:3 7:3 8:2 9:2 10:21 11:72

6 1:182 2:1 3:6 4:1511 5:2957 6:1 7:1 8:2 9:2 10:1 11:3

[label] [index1]:[value1] [index2]:[value2] ...

[label] [index1]:[value1] [index2]:[value2] ...

libSVM- Types -S <int> Set type of SVM (default: 0)

0 = C-SVC 1 = nu-SVC 2 = one-class SVM 3 = epsilon-SVR 4 = nu-SVR

C-svc : binary classification One class svm : anomaly detection,high-

dimensional distribution -ϵ Support Vector Regression ( -ϵ SVR), nu-

SVR

libSVM- kernels

K <int> Set type of kernel function (default: 2)

○ 0 = linear: u'*v ○ 1 = polynomial: (gamma*u'*v + coef0)^degree○ 2 = radial basis function: exp(-gamma*|u-v|^2)○ 3 = sigmoid: tanh(gamma*u'*v + coef0)

A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes a library and

command-line tools for the learning task. libSVM and LIBLINEAR share similar usage as

well as application program interfaces (APIs) and hence ease of use

models after training are quite different (in particular, LIBLINEAR stores w in the model, but LIBSVM does not.)

When to Use LIBLINEAR and libSVM

Number of instances << number of features - linear kernel or LIBLINEAR preferred

Both numbers of instances and features are large- LIBLINEAR is preferred

Number of instances >> number of features – non-linear kernels are preferred(RBF)

Where to find help?

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

http://www.csie.ntu.edu.tw/~cjlin/liblinear/

Documentation API Tools Datasets

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

http://www.csie.ntu.edu.tw/~cjlin/liblinear/

GMM- Gaussian Mixture Models

Gaussian Distribution

Carl Friedrich Gauss invented the normal distribution in 1809 as a way to rationalize the method of least squares

GMM

D=1

D=2

7

What is a Gaussian mixture model?

• Problem:

Given a set of data X ={x1,x2,...,xN} drawn from anunknown distribution (probably aGMM), estimate the

parameters θ of the GMM model that fits the data.

• Solution:

Maximize the likelihood p(X |θ) of the data with regard

to the model parameters?

The Expectation-Maximization algorithm

• Basic ideas of the EM algorithm:

the maximization of the likelihood.-Introduce a hidden variable such that its knowledge would simplify

- At each iteration:

• E-Step: Estimate the distribution of the hidden variable giventhe data and the current value of the parameters.

• M-Step: Maximize the joint distribution of the data and thehidden variable.

8

One of the most popular approaches to maximize the likelihoodis to use theExpectation-Maximization (EM) algorithm.

9

The EM for the GMM (graphical view 1)

Hidden variable: for each point, which Gaussian generated it?

10


E-Step: for each point, estimate the probability that each Gaussiangenerated it.

11


M-Step: modify the parameters according to the hidden variable tomaximize the likelihood of the data (and the hidden variable).

ISODATA

What is ISODATA? Iterative Self- Organizing Data Analysis Technique

Properties ISODATA is a method of semi-

unsupervised classification Don’t need to know the number of

clusters Algorithm splits and merges clusters User defines threshold values for

parameters Algorithm runs for many iterations until

threshold is reached

How ISODATA works? Cluster centers are randomly placed and

pixels are assigned based on the shortest distance to center method

The standard deviation within each cluster, and the distance between cluster centers is calculated

Clusters are split if one or more standard deviation is greater than the user-defined threshold

Clusters are merged if the distance between them is less than the user-defined threshold

Visualize

ISODATA- Continued

A second iteration is performed with the new cluster centers

Further iterations are performed until:the average inter-center distance falls below

the user-defined thresholdthe average change in the inter-center

distance between iterations is less than a threshold

the maximum number of iterations is reached

ALSO!!! Highlights

Clusters associated with fewer than the user-specified minimum number of pixels are eliminated

Lone pixels are either put back in the pool for reclassification, or ignored as “unclassifiable”

Drawbacks of ISODATA

May be time consuming if data is very unstructured

Algorithm can spiral out of control leaving only one class

Advantages of ISODATA

Don’t need to know much about the data beforehand

Little user effort required ISODATA is very effective at identifying

spectral clusters in data

Other Tools

Apache Mahout- Scalable machine learning libraryImproved clustering and classification

algorithm

R- with hadoop pluginImplentation of most of the standard

algorithm

Challenges

Sparse Learning in High Dimensions Semi-Supervised Learning Computation and Risk Structured Prediction Heavily dependent on PAST!!

Future Scope

Lots of area can benefit from statistical learning

More parallel implementations of algorithmsOn massively parallel platformsDistributed platformsGPU computing

Thank You

statistical machine learning on large scale datakrishnarajpm.com/bigdata/pramod.pdf · a cousin-...

Documents