statistical machine learning on large scale datakrishnarajpm.com/bigdata/pramod.pdf · a cousin-...
TRANSCRIPT
Statistical Machine Learning on Large Scale
Data
By
Pramod N
Questions!!? What? Why? Where? How? ML-201
Why would machine learning techniques work?
Data Data Data – Nature of Data
Modelling of different kinds of problem to
form relevant data
Size of Data available for Training
What?: Statistical Machine Learning Exploits the nature of data Solution to cope with uncertainty Noisy data but fall under a distribution Employs basic probability decisions( Bayes
Theorem)
Why?: Use Statistical ML
Models are mature Mathematical support to optimize Can handle multi dimensionality
efficiently
Types and Models
SVM- Support Vector MachinesLibsvmliblinear
GMM- Gaussian Mixture ModelsK-meansISODATA
Support Vector Machines Mechanism Developed by Vladimir N.
Vapnik as early as 1979 current implementations are based on
Vapnik and Corinna Cortes’ work in 1995
Separation by hyperplanes Maximize margin of classification
Classification
Choice of boundary
8
Are these really “equally valid”?
How?: SVMs
Hyperplane
Max Margin
How can we pick which is best?
Maximize the size of the margin.
9
Are these really “equally valid”?
Small Margin
Large Margin
Hyperplane
Support Vectors
Support Vectors are those input points (vectors) closest to the decision boundary
They are vectors They “support” the
decision hyperplane
10
Support Vectors
The decision hyperplane:
Decision and Margin Function:
11
How do we represent the size of the margin in terms
of w? There must at least one
point that lies on each support hyperplanes
12
If not, we could define a larger margin support hyperplane that does touch the nearest point(s).
Summary
Summary- Multi Class
libSVM
an integrated software for support vector classification
supports multi-class classification Both C++ and Java sources Python, R, MATLAB, Perl, Ruby, Weka,
Common LISP, CLISP, Haskell, LabVIEW, and PHP interfaces. C# .NET code and CUDA extension is available.
libSVM- command line options
svm-train [input] [options]optimization finished, #iter = 87
nu = 0.471645
obj = -67.299458, rho = 0.203495
nSV = 88, nBSV = 72
Total nSV = 88
svm-predict [input] [model] [output]Output labels written to output
Accuracy = 83% (83/100) (classification)
libSVM-Input Format1 1:0 2:1 3:1 4:239 5:486 6:8 7:8 8:2 9:2 10:7 11:6
1 1:2 2:1 3:3 4: 5:486 6:8 7:8 8:2 9:2 10:7 11:6
1 1:0 2:1 3:5 4:1367 5:335 6:3 7:3 8:2 9:2 10:21 11:72
6 1:182 2:1 3:6 4:1511 5:2957 6:1 7:1 8:2 9:2 10:1 11:3
1 1:0 2:1 3:1 4:239 5:486 6:8 7:8 8:2 9:2 10:7 11:6
6 1:2 2:1 3:3 4: 5:486 6:8 7:8 8:2 9:2 10:7 11:6
1 1:0 2:1 3:5 4:1367 5:335 6:3 7:3 8:2 9:2 10:21 11:72
6 1:182 2:1 3:6 4:1511 5:2957 6:1 7:1 8:2 9:2 10:1 11:3
[label] [index1]:[value1] [index2]:[value2] ...
[label] [index1]:[value1] [index2]:[value2] ...
libSVM- Types -S <int> Set type of SVM (default: 0)
0 = C-SVC 1 = nu-SVC 2 = one-class SVM 3 = epsilon-SVR 4 = nu-SVR
C-svc : binary classification One class svm : anomaly detection,high-
dimensional distribution -ϵ Support Vector Regression ( -ϵ SVR), nu-
SVR
libSVM- kernels
K <int> Set type of kernel function (default: 2)
○ 0 = linear: u'*v ○ 1 = polynomial: (gamma*u'*v + coef0)^degree○ 2 = radial basis function: exp(-gamma*|u-v|^2)○ 3 = sigmoid: tanh(gamma*u'*v + coef0)
A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes a library and
command-line tools for the learning task. libSVM and LIBLINEAR share similar usage as
well as application program interfaces (APIs) and hence ease of use
models after training are quite different (in particular, LIBLINEAR stores w in the model, but LIBSVM does not.)
When to Use LIBLINEAR and libSVM
Number of instances << number of features - linear kernel or LIBLINEAR preferred
Both numbers of instances and features are large- LIBLINEAR is preferred
Number of instances >> number of features – non-linear kernels are preferred(RBF)
Where to find help?
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
http://www.csie.ntu.edu.tw/~cjlin/liblinear/
Documentation API Tools Datasets
GMM- Gaussian Mixture Models
Gaussian Distribution
Carl Friedrich Gauss invented the normal distribution in 1809 as a way to rationalize the method of least squares
GMM
D=1
D=2
7
What is a Gaussian mixture model?
• Problem:
Given a set of data X ={x1,x2,...,xN} drawn from anunknown distribution (probably aGMM), estimate the
parameters θ of the GMM model that fits the data.
• Solution:
Maximize the likelihood p(X |θ) of the data with regard
to the model parameters?
The Expectation-Maximization algorithm
• Basic ideas of the EM algorithm:
the maximization of the likelihood.-Introduce a hidden variable such that its knowledge would simplify
- At each iteration:
• E-Step: Estimate the distribution of the hidden variable giventhe data and the current value of the parameters.
• M-Step: Maximize the joint distribution of the data and thehidden variable.
8
One of the most popular approaches to maximize the likelihoodis to use theExpectation-Maximization (EM) algorithm.
9
The EM for the GMM (graphical view 1)
Hidden variable: for each point, which Gaussian generated it?
10
The EM for the GMM (graphical view 2)
E-Step: for each point, estimate the probability that each Gaussiangenerated it.
11
The EM for the GMM (graphical view 3)
M-Step: modify the parameters according to the hidden variable tomaximize the likelihood of the data (and the hidden variable).
ISODATA
What is ISODATA? Iterative Self- Organizing Data Analysis Technique
Properties ISODATA is a method of semi-
unsupervised classification Don’t need to know the number of
clusters Algorithm splits and merges clusters User defines threshold values for
parameters Algorithm runs for many iterations until
threshold is reached
How ISODATA works? Cluster centers are randomly placed and
pixels are assigned based on the shortest distance to center method
The standard deviation within each cluster, and the distance between cluster centers is calculated
Clusters are split if one or more standard deviation is greater than the user-defined threshold
Clusters are merged if the distance between them is less than the user-defined threshold
Visualize
ISODATA- Continued
A second iteration is performed with the new cluster centers
Further iterations are performed until:the average inter-center distance falls below
the user-defined thresholdthe average change in the inter-center
distance between iterations is less than a threshold
the maximum number of iterations is reached
ALSO!!! Highlights
Clusters associated with fewer than the user-specified minimum number of pixels are eliminated
Lone pixels are either put back in the pool for reclassification, or ignored as “unclassifiable”
Drawbacks of ISODATA
May be time consuming if data is very unstructured
Algorithm can spiral out of control leaving only one class
Advantages of ISODATA
Don’t need to know much about the data beforehand
Little user effort required ISODATA is very effective at identifying
spectral clusters in data
Other Tools
Apache Mahout- Scalable machine learning libraryImproved clustering and classification
algorithm
R- with hadoop pluginImplentation of most of the standard
algorithm
Challenges
Sparse Learning in High Dimensions Semi-Supervised Learning Computation and Risk Structured Prediction Heavily dependent on PAST!!
Future Scope
Lots of area can benefit from statistical learning
More parallel implementations of algorithmsOn massively parallel platformsDistributed platformsGPU computing
Thank You