silicon valleycodecamp2013

31
Something about Data Sanjeev Mishra Chris Bedford

Upload: sanjeev-mishra

Post on 27-Jan-2015

114 views

Category:

Technology


1 download

DESCRIPTION

A presentation on machine learning, supervised and unsupervised. Examples of usage of machine learning we see everyday.

TRANSCRIPT

Page 1: Silicon valleycodecamp2013

Something about Data

Sanjeev Mishra Chris Bedford

Page 2: Silicon valleycodecamp2013

Acknowledgement

● Bing for free images● Machine Learning in Action (Peter Harrington)● Wikipedia

Page 3: Silicon valleycodecamp2013

Did you know that?

Page 4: Silicon valleycodecamp2013

What about these?

Page 5: Silicon valleycodecamp2013

What about these?

Page 6: Silicon valleycodecamp2013

I guess you have heard of

● Siri or Google Now● IBM Watson● IBM Deep Blue● Google Translate● WolframAlpha

Page 7: Silicon valleycodecamp2013

The Big Picture

Page 8: Silicon valleycodecamp2013

What is Learning

Definition:The acquisition of knowledge or skills through experience, study, or by being taught.

Knowledge

Knowledge

reasoning

deduction

reasoning

Page 9: Silicon valleycodecamp2013

What is Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E

Page 10: Silicon valleycodecamp2013

Data Mining

● Computational process of discovering patterns in large data sets○ Structured or unstructured data○ Patterns must be: valid, novel, potentially

useful, understandable■ 80% of customers who buy cheese and milk also

buy bread, and 5% of customers buy all of them together

■ Correlation among variables: positive or negative

Page 11: Silicon valleycodecamp2013

Types Machine Learning

Unsupervised SupervisedLearn the patterns in data ● no training● face detection in a set images● group objects based on some

similarity● clustering (nominal data)● density estimation (numeric data)

Predict or forecast a something● training● recognize a face in a set of images● given an object predict the type● classification (nominal data)● regression or curve fitting (numeric

data)

Page 12: Silicon valleycodecamp2013

Clustering

Page 13: Silicon valleycodecamp2013

Clustering using k-Means● Input

○ M (set of points)○ k (number of clusters)

● Output○ k cluster centroids c1,..ck (ci is the centroid of all xj € Si)

● Approach○ Minimizing the squared error function

where is a chosen distance measure between a data point and the cluster centre , is an indicator of the distance of the n data points from their respective cluster centres.

Page 14: Silicon valleycodecamp2013

k-Means

create k points for starting centroids (random)while any point has changed cluster assignment

for every point in our dataset:for every centroid

calculate the distance between the centroid and pointassign the point to the cluster with the lowest distance

for every cluster calculate the mean of the points in that clusterassign the centroid to the mean

Clustering Demo

Page 15: Silicon valleycodecamp2013

k-Means

Pros● Easy to implement● Fast on small dataset

Cons● A priori knowledge of K● Slow on very large dataset● Sensitive to outliers ● Can converge to local minima

Page 16: Silicon valleycodecamp2013

k-Means (wrong k)K = 4

K = 3

Page 17: Silicon valleycodecamp2013

Improving K-means

● Bisecting K-means○ Choose cluster with largest SSE○ Split it till k

Page 18: Silicon valleycodecamp2013

Supervised Learning: Linear Regression

Attempts to find a mathematical (linear) function that can approximate the relationship between a set of one or more input variables and what is called a response variable.

Example: A web site for amusement park X

* Interested in offering ride coupons * Rides have height requirements * Avoid issuing coupons for ride if user is too short * Most users sign up from Facebook, so we have their ages. * So: we use age to predict height.

Page 19: Silicon valleycodecamp2013

Supervised Learning: Linear Regression

Page 20: Silicon valleycodecamp2013

Supervised Learning: Linear Regression

Page 21: Silicon valleycodecamp2013

Supervised Learning: Linear Regression

A more complex data set: two input variables.

sqFt,bathrooms,priceInThousands1200,1,7501250,2,9002000,2.5,15001800,2,12001000,1.5,7001800,3,14001100,1.5,8002200,3,17001250,1.5,8501300,2,1100

Our previous example had a one dimensional set of input variables, now we have a 2-dimensional set: for each two-tuple consisting of numBathrooms and squareFeet we have the selling price of a corresponding home. From this training data, we create a model that predicts a “plane of best fit”. Given a new two-tuple [ numBathrooms-x, squareFeet-y ] our model will predict the point on the plane which denotes the most likely selling prices for a house with those attributes.

FOR SALE

Page 22: Silicon valleycodecamp2013

Supervised Learning: Linear Regression

For a one dimensional set of input variables we had a line of best fit, for a two dimensional set, we have a plane of best fit. Here’s what our plane looks like.

Page 23: Silicon valleycodecamp2013

Why Use R ?

Many data scientists use R, due to

- extensive, well tested libraries of statistical, mathematical functions - math friendly syntax - excellent support for charting and plotting functions - active user community to provide support

R skills are valuable for big data engineers, since:

- data scientists we work with will often develop their models using R - significant effort is required to translate such models to Java, C++, etc.

So: useful not only to understand R, but also to be able to invoke R from your native language

Page 24: Silicon valleycodecamp2013

R code for 2 dimensional model

values <- read.csv(filePath)model <- lm(priceInThousands ~ sqFt + bathrooms, data=values)

# predict new value## set up 'data frame'newdata <- data.frame(sqFt=1600, bathrooms=3)## invoke prediction functionpredict(model, newdata)

csv file is in same format we saw in intro slide on linear regression

response variableinput (independent) variables

R’s linear model creation function response variableresponse variable

predict most likely selling price using model ‘model’ and the data frame that wraps variables sqFt (1600), and bathrooms (3).

Page 25: Silicon valleycodecamp2013

Calling R from Java

import org.rosuda.JRI.REXP;import org.rosuda.JRI.Rengine;

class RegressionModelExecutor {

// Current R session (only one per JVM, // since rjava is not multi-threaded). Rengine rengine = null

RegressionModelExecutor(String inputDataPath) { String []engineArgs = new String[1]; engineArgs [0] = "--vanilla"; rengine=new Rengine (engineArgs, false, null);

String script = """ values = read.csv('$inputDataPath') newModel.lm = lm( priceInThousands ~ sqFt + bathrooms, data=values)""" evaluateScript(script) // initialize model }

public void shutdown() { rengine.end(); }

// Apply model 'newModel.lm' to predict price of a house // with given values for squareFeet and numBathrooms. public double predictInstance(int sqft, float baths) { rengine.eval( "newdata = data.frame( sqFt=$sqft, bathrooms=$baths)") REXP result = rengine.eval( "predict(newModel.lm , newdata)") return result.asDouble() }

// Evaluate block of R expressions, taking into account // the fact that Rengine only executes one statement at // a time. Unconditionally dumps out lines before executing // the script so that if anything goes wrong we can copy // paste the constructed output (scriptLines) directly // into an R session. public String evaluateScript(String scriptLines) { println("evaluating: \n$scriptLines") for (String line: scriptLines.split("\n")) { rengine.eval(line) } }~

Page 26: Silicon valleycodecamp2013

Calling R from Java

More detailed article on R/Java:

http://buildlackey.com/integrating-r-and-java-with-jrirjava-a-jni-based-bridge/

Page 27: Silicon valleycodecamp2013

How linear regression = Machine learning?

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E

Page 28: Silicon valleycodecamp2013

Supervised Learning: Linear Regression

LEARN MORE:

KHAN ACADEMY https://www.khanacademy.org/

COURSERA: Coding the Matrix Course (Linear Algebra) http://www.youtube.com/watch?v=IWugXcWpfoM

MIT Open Courseware Linear Algebra Course http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/index.htm

Page 29: Silicon valleycodecamp2013

Software and Tools

● Apache Mahout (http://mahout.apache.org/): Java, Apache● http://prediction.io/ (Machine learning server)● Weka (http://www.cs.waikato.ac.nz/ml/weka/): Java, GPL ● OpenNLP (http://opennlp.apache.org/): Java, Apache● Stanford NLP (http://nlp.stanford.edu/software/): Java, GPL● Scikit-learn (http://scikit-learn.org/stable/): Python, BSD● mply (http://mlpy.sourceforge.net/): Python, GPL● NLTK (http://nltk.org/): Python, Apache● http://www.alchemyapi.com/Tools

R, Matlab, Octave

http://mloss.org/software/http://sourceforge.net/directory/science-engineering/ai/machinelearning/os:linux/freshness:recently-updated/

Page 30: Silicon valleycodecamp2013

Courses and other materials

● Coursera (http://www.coursera.org/): ○ machine learning○ natural language processing○ neural networks

● Udacity (https://www.udacity.com/courses)○ artificial intelligence

● http://cs229.stanford.edu/materials.html● http://www.ai.mit.edu/courses/6.867-f03/lectures.html● wikipedia.org