intro_to_ml

27
Introduction to Machine Learning with Apache Spark! Spark Meetup, 12.03.2015, Marko Velić PhD

Upload: marko-velic

Post on 28-Jul-2015

184 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to Machine Learningwith Apache Spark!

Spark Meetup, 12.03.2015, Marko Velić PhD

Lecturer• 2014 - PhD in machine Learning, Faculty of

Organisation and Informatics, Varazdin, UNIZG• Dozen of papers, projects and two patents pending in

machine learning• Work experience:

• 2015. Data Lab – consulting, „Data Science” and machine learning for some of the biggest companies (both Croatian and global)

• Currently establishing Big Data department at Styria group• 2013-2015 – University Computing Centre, head of data

analysis department• 2007-2013 – CEO of one small development company

• Since 2011. Lecturer at Algebra University (C++, ML etc)• Interests: artificial intelligence, machine learning,

computer vision, deep learning

Survey – Your experience with ML?• Used/developed in commercial projects

• Used/developed in academia

• Trying out on my own

• Never have used

• Never heard

How do they do it?

Content

• What is AI?

• What is ML?

• Learning types

• Variable types

• Spark MLlib and ML

• Naive Bayes

• Model testing

• Demo

• Where to learn ML? What’s next?

What is AI?

AI

Heuristics

Rules + Logic

Fuzzy Logic

Machine Learning

What is ML?

Information Theory

Statistics, Probability,

Mathematics

Software Engineering

Learning types

• Supervised• Class is known

• Learning from experience

• Unsupervised• Class is unknown

• Grouping (searching for) similar points

TrminologySynonyms in Croatian Synonyms in English

Opservacija, podatak Observation, Data instance, Example, Data Sample, Point

Klasa, zavisna varijabla, ciljna varijabla Class, Dependent variable, Goal, Outcome

Varijabla, značajka, atribut, nezavisna var.

Variable, Feature, Attribute, Independent var.

Prenaučenost, pretreniranost modela Model Overfitting

Kontinuirane, kvantitativne varijable Continuous, Numeric, Quantitative

Diskretne, kvalitativne varijable Discrete, Qualitative

Klasifikacija, raspoznavanje, razvrstavanje

Classification

Grupiranje, klasteriranje Clustering

Anotirani, označeni podaci Annotated, Labelled Dataset (Points)

Data/Variable Types

Discrete

Nominal Ordinal

Continuous

Interval Ratio

= , <> > , < , >= , <= + , - * , /Possible operations:

Why is this important?• Descriptive statistics• Preprocessing techniques• Choosing the ML method/algorithm• Testing methodologies• Results interpretation

More on this:https://www.youtube.com/watch?v=YFC2KUmEebcDavid Mease, Google Tech Talks 2007

Spark

• MLlib• Longer development

• Lots of developers and methods

• Tested well

• ML• New

• Shoud make ML in Spark easier

• Support for the entire ML „pipeline”

• Alpha

• Bugs?

Spark – ML methods (MLlib)• Data types

• Basic statistics• summary statistics• correlations• stratified sampling• hypothesis testing• random data generation

• Classification and regression• linear models (SVMs, logistic regression, linear regression)• naive Bayes• decision trees• ensembles of trees (Random Forests and Gradient-Boosted Trees)

• Collaborative filtering• alternating least squares (ALS)

• Clustering• k-means

• Dimensionality reduction• singular value decomposition (SVD)• principal component analysis (PCA)

• Feature extraction and transformation

• Optimization (developer)• stochastic gradient descent• limited-memory BFGS (L-BFGS)

Naive BayesChills Runny Nose Headache Fever Flu?

Yes No Moderate Yes No

Yes Yes No No Yes

Yes No Strong Yes Yes

No Yes Moderate Yes Yes

No No No No No

No Yes Strong Yes Yes

No Yes Strong No No

Yes Yes Moderate Yes Yes

Yes No Moderate No ?

What about the next patient? Symptoms:

Calculation 1/2

Condition Probability Condition Probability

P(Flu=Yes) 0,625 P(Flu=No) 0,375

P(Chills=Yes|Flu=Yes) 0,6 P(Chills=Yes|Flu=No) 0,333

P(Chills=No|Flu=Yes) 0,4 P(Chills=No|Flu=No) 0,666

P(Runny Nose=Yes|Flu=Yes) 0,8 P(Runny Nose=Yes|Flu=No) 0,333

P(Runny Nose=No|Flu=Yes) 0,2 P(Runny Nose=No|Flu=No) 0,666

P(Headache=Moderate|Flu=Yes) 0,4 P(Headache=Moderate|Flu=No) 0,333

P(Headache=No|Flu=Yes) 0,2 P(Headache=No|Flu=No) 0,333

P(Headache=Strong|Flu=Yes) 0,4 P(Headache=Strong|Flu=No) 0,333

P(Temperature=Yes|Flu=Yes) 0,8 P(Temperature=Yes|Flu=No) 0,333

P(Temperature=No|Flu=Yes) 0,2 P(Temperature=No|Flu=No) 0,666

)(

)()|()|(

EP

HPHEPEHP

Calculation 2/2

• Za pacijenta:

• Just multiply:

• P(Flu=Yes)P(Chills=Yes|Flu=Yes)P(Runny Nose=No|Flu=Yes)P(Headache=Moderate|Flu=Yes)P(Temperature=No|Flu=Yes) = ?

• P(Flu=No)P(Chills=Yes|Flu=No)P(Runny Nose=No|Flu=No)P(Headache=Moderate|Flu=No)P(Temperature=No|Flu=No) = ?

Example source: https://www.youtube.com/watch?v=ZAfarappAO0

Chills Runny Nose Headache Fever Flu?

Yes No Moderate No ?

Model testing – confusion matrix and error types

Predicted Value

Positive (P’) Negative (N’)

Actual Value

Positive (P) True Positive (TP) False Negative (FN)

Negative (N) False Positive (FP) True Negative (TN)

Model testing – success/accuracy measures

• Classification Accuracy • (TP+TN)/(TP+TN+FP+FN)

• Sensitivity • TP/P = TP/(TP+FN)

• SpecificityTN/N = TN/(TN+FP)

• Positive Predictive Value PPVTP/P’ = TP/(TP+FP)

• Negative Predictive Value NPVTN/N’ = TN/(TN + FN)

Why ML in Spark?

• MLlib (and ML) based on Spark

• Speed comes from Spark (distributed learning, in memory, fault tolerance etc...)

• Lots of Algorothms

• API is simple to use

• Various languages (Scala, Java, Python)

• Open source community (very active)

• Simple integration with other Spark components eg. Spark Streaming and „online” learning

• Spark ecosystem for the entire „pipeline”

Source: "MLlib: Spark's Machine Learning Library" by Ameet Talwalkar at AMPCamp 5 - http://www.slideshare.net/jeykottalam/mllib

Features• Always starting with „table”

• Rows are data points

• Columns are variables/features

• Dense – All fields are filled

• Sparse – Only „non-zero” data

• Feature hashing

•John likes to watch movies.•Mary likes movies too.•John also likes football.

„John likes to watch movies. Mary likes too.John also likes to watch football games.”

Dictionary: {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10}

Matrix: [[1 2 1 1 1 0 0 0 1 1] [1 1 1 1 0 1 1 1 0 0]]

Sources: http://en.wikipedia.org/wiki/Feature_hashing and http://stats.stackexchange.com/questions/73325/understanding-feature-hashing

Spark Demo – Sentiment Analysis

• Annotated dataset of business news in Croatian language

• Source: icapital.hr

• Small dataset (500)• We do not expect

spectacular results

• Three classes• Positive• Negative• Neutral?

Natural Language Processing / Text Mining• Preprocessing

• Stemming

• Lemamatization

• Features• Bag of Words, n-grams

• TF(t) (Term Frequency) = Occurances of term t in document / Total number of terms in document

• IDF(t) (Inverse Document Frequency) = log(Total number of documents / Documents containing t)

• Linguistic variables...

NLP in Croatia

• FFZG• Free components

• http://nlp.ffzg.hr

• FER• Text Mining Add-On for Orange

• https://bitbucket.org/biolab/orange-text/src

• FOI – www.foi.hr

• Someone else?

Typical ML/NLP workflow (Orange)

Most of this we can do in Spark, soon all of it (ML „Pipelines”)...

Where to learn ML?

• Coratian universities• FER, FOI, PMF, Algebra, FFZG for NLP etc.

• By yourself – Internet• Papers, books, blogs• MOOCs (Coursera, edX etc.) • Famous https://www.coursera.org/course/ml

• Prerequisites (beside programming):• https://www.khanacademy.org/math/differential-calculus• https://www.khanacademy.org/math/linear-algebra• https://www.khanacademy.org/math/probability• https://www.coursera.org/course/matrix• https://www.coursera.org/learn/calculus1

• Great resource for Spark: http://ampcamp.berkeley.edu/

Next lectures?

• Entropy and variable importance?• Methods

• Linear regression and optimization (Gradient descent)• Logistic regression• Decision trees (Random Forests)• Unsupervised learning• Collaborative filtering• Neural networks (not in Spark - for now )• ...

• Model testing (sampling, measures, ROC curve...)• ML tips&tricks (regularization, overfitting etc.)• ...

Content

• What is AI?

• What is ML?

• Learning types

• Variable types

• Spark MLlib and ML

• Naive Bayes

• Model testing

• Demo

• Where to learn ML? What’s next?