silicon valleycodecamp2013

Something about Data

Sanjeev Mishra Chris Bedford

Acknowledgement

● Bing for free images● Machine Learning in Action (Peter Harrington)● Wikipedia

Did you know that?

What about these?

I guess you have heard of

● Siri or Google Now● IBM Watson● IBM Deep Blue● Google Translate● WolframAlpha

The Big Picture

What is Learning

Definition:The acquisition of knowledge or skills through experience, study, or by being taught.

Knowledge

Knowledge

reasoning

deduction

reasoning

What is Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E

Data Mining

● Computational process of discovering patterns in large data sets○ Structured or unstructured data○ Patterns must be: valid, novel, potentially

useful, understandable■ 80% of customers who buy cheese and milk also

buy bread, and 5% of customers buy all of them together

■ Correlation among variables: positive or negative

Types Machine Learning

Unsupervised SupervisedLearn the patterns in data ● no training● face detection in a set images● group objects based on some

similarity● clustering (nominal data)● density estimation (numeric data)

Predict or forecast a something● training● recognize a face in a set of images● given an object predict the type● classification (nominal data)● regression or curve fitting (numeric

data)

Clustering

Clustering using k-Means● Input

○ M (set of points)○ k (number of clusters)

● Output○ k cluster centroids c1,..ck (ci is the centroid of all xj € Si)

● Approach○ Minimizing the squared error function

where is a chosen distance measure between a data point and the cluster centre , is an indicator of the distance of the n data points from their respective cluster centres.

k-Means

create k points for starting centroids (random)while any point has changed cluster assignment

for every point in our dataset:for every centroid

calculate the distance between the centroid and pointassign the point to the cluster with the lowest distance

for every cluster calculate the mean of the points in that clusterassign the centroid to the mean

Clustering Demo

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

k-Means

Pros● Easy to implement● Fast on small dataset

Cons● A priori knowledge of K● Slow on very large dataset● Sensitive to outliers ● Can converge to local minima

k-Means (wrong k)K = 4

K = 3

Improving K-means

● Bisecting K-means○ Choose cluster with largest SSE○ Split it till k

Supervised Learning: Linear Regression

Attempts to find a mathematical (linear) function that can approximate the relationship between a set of one or more input variables and what is called a response variable.

Example: A web site for amusement park X

* Interested in offering ride coupons * Rides have height requirements * Avoid issuing coupons for ride if user is too short * Most users sign up from Facebook, so we have their ages. * So: we use age to predict height.


A more complex data set: two input variables.

sqFt,bathrooms,priceInThousands1200,1,7501250,2,9002000,2.5,15001800,2,12001000,1.5,7001800,3,14001100,1.5,8002200,3,17001250,1.5,8501300,2,1100

Our previous example had a one dimensional set of input variables, now we have a 2-dimensional set: for each two-tuple consisting of numBathrooms and squareFeet we have the selling price of a corresponding home. From this training data, we create a model that predicts a “plane of best fit”. Given a new two-tuple [ numBathrooms-x, squareFeet-y ] our model will predict the point on the plane which denotes the most likely selling prices for a house with those attributes.

FOR SALE


For a one dimensional set of input variables we had a line of best fit, for a two dimensional set, we have a plane of best fit. Here’s what our plane looks like.

http://www.youtube.com/watch?v=S2xDeO9dJ7U

Why Use R ?

Many data scientists use R, due to

- extensive, well tested libraries of statistical, mathematical functions - math friendly syntax - excellent support for charting and plotting functions - active user community to provide support

R skills are valuable for big data engineers, since:

- data scientists we work with will often develop their models using R - significant effort is required to translate such models to Java, C++, etc.

So: useful not only to understand R, but also to be able to invoke R from your native language

R code for 2 dimensional model

values <- read.csv(filePath)model <- lm(priceInThousands ~ sqFt + bathrooms, data=values)

# predict new value## set up 'data frame'newdata <- data.frame(sqFt=1600, bathrooms=3)## invoke prediction functionpredict(model, newdata)

csv file is in same format we saw in intro slide on linear regression

response variableinput (independent) variables

R’s linear model creation function response variableresponse variable

predict most likely selling price using model ‘model’ and the data frame that wraps variables sqFt (1600), and bathrooms (3).

Calling R from Java

import org.rosuda.JRI.REXP;import org.rosuda.JRI.Rengine;

class RegressionModelExecutor {

// Current R session (only one per JVM, // since rjava is not multi-threaded). Rengine rengine = null

RegressionModelExecutor(String inputDataPath) { String []engineArgs = new String[1]; engineArgs [0] = "--vanilla"; rengine=new Rengine (engineArgs, false, null);

String script = """ values = read.csv('$inputDataPath') newModel.lm = lm( priceInThousands ~ sqFt + bathrooms, data=values)""" evaluateScript(script) // initialize model }

public void shutdown() { rengine.end(); }

// Apply model 'newModel.lm' to predict price of a house // with given values for squareFeet and numBathrooms. public double predictInstance(int sqft, float baths) { rengine.eval( "newdata = data.frame( sqFt=$sqft, bathrooms=$baths)") REXP result = rengine.eval( "predict(newModel.lm , newdata)") return result.asDouble() }

// Evaluate block of R expressions, taking into account // the fact that Rengine only executes one statement at // a time. Unconditionally dumps out lines before executing // the script so that if anything goes wrong we can copy // paste the constructed output (scriptLines) directly // into an R session. public String evaluateScript(String scriptLines) { println("evaluating: \n$scriptLines") for (String line: scriptLines.split("\n")) { rengine.eval(line) } }~

Calling R from Java

More detailed article on R/Java:

http://buildlackey.com/integrating-r-and-java-with-jrirjava-a-jni-based-bridge/

How linear regression = Machine learning?

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E


LEARN MORE:

KHAN ACADEMY https://www.khanacademy.org/

COURSERA: Coding the Matrix Course (Linear Algebra) http://www.youtube.com/watch?v=IWugXcWpfoM

MIT Open Courseware Linear Algebra Course http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/index.htm

https://www.khanacademy.org/

http://www.youtube.com/watch?v=IWugXcWpfoM

http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/index.htm

Software and Tools

● Apache Mahout (http://mahout.apache.org/): Java, Apache● http://prediction.io/ (Machine learning server)● Weka (http://www.cs.waikato.ac.nz/ml/weka/): Java, GPL ● OpenNLP (http://opennlp.apache.org/): Java, Apache● Stanford NLP (http://nlp.stanford.edu/software/): Java, GPL● Scikit-learn (http://scikit-learn.org/stable/): Python, BSD● mply (http://mlpy.sourceforge.net/): Python, GPL● NLTK (http://nltk.org/): Python, Apache● http://www.alchemyapi.com/Tools

R, Matlab, Octave

http://mloss.org/software/http://sourceforge.net/directory/science-engineering/ai/machinelearning/os:linux/freshness:recently-updated/

http://mahout.apache.org/

http://prediction.io/

http://prediction.io/

http://www.cs.waikato.ac.nz/ml/weka/

http://opennlp.apache.org/

http://nlp.stanford.edu/software/

http://scikit-learn.org/stable/

http://mlpy.sourceforge.net/

http://nltk.org/

http://mloss.org/software/

http://mloss.org/software/

Courses and other materials

● Coursera (http://www.coursera.org/): ○ machine learning○ natural language processing○ neural networks

● Udacity (https://www.udacity.com/courses)○ artificial intelligence

● http://cs229.stanford.edu/materials.html● http://www.ai.mit.edu/courses/6.867-f03/lectures.html● wikipedia.org

https://www.udacity.com/courses

http://cs229.stanford.edu/materials.html

http://cs229.stanford.edu/materials.html

http://www.ai.mit.edu/courses/6.867-f03/lectures.html

http://www.ai.mit.edu/courses/6.867-f03/lectures.html

http://www.wikipedia.org/

http://www.wikipedia.org/

Something about Data

Sanjeev Mishra Chris [email protected] [email protected]

mailto:[email protected]



silicon valleycodecamp2013

Technology

training data

data scientists

complex data set

supervised learning

n data points

unstructured data patterns

set of points

large data sets