advanced data science regression - technion · pdf filetitanic - machine learning from...

Post on 16-Mar-2018

222 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Advanced Data ScienceRegression

Idan Schwartz

Titanic - Machine Learning from Disaster

• On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

• One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

• Question: what sorts of people were likely to survive?

Data cleaning

• To apply Machine Learning models, data must be converted to a tabular form.

• This whole process is the most time consuming and difficult process

Environment - python

• We will work with python.• Fun to code!

• Clean syntax, no semicolons, no type safety

• Loads of data structures: dict = {}, list = []

• Easy to iterate: [f(x) for x in dict]

• Support OO, functional programing

• Awesome Libraries - Python has all the libraries you need.• Use Anaconda package for easy start - includes over 100 of the most popular Python

packages for data science.

Data Handling Tools - pandas

• Python's version of Excel.

• Easy to read and manipulate data.• Column insertion and deletion• merging and joining• Aggregation (Group by engine)

• Easy statistics• One command to get all statistics (df.describe)• Time series-functionality

• Easy plotting.

• Other resources for working with pandas:• http://pandas.pydata.org/pandas-docs/stable/tutorials.html

Data observation

• Results:

import pandas as pd

df = pd.read_csv("data/train.csv")

Features – (or 𝑋)

• Each row is an observation.

• Each column tells us something about each of our observations, like their name, sex or age.

• These columns are called a features of our dataset.

• Most features have complete data on every observation, like Survivedfeature, while some are missing information like Age.

Types of features

• There are usually three types of variables:• numerical variables

• Age, sibsp, parch etc

• categorical variables• pclass, sex, embarked

• variables with text inside them• Name

scikit-learn

• Collection of machine learning algorithms and tools in Python.

• Built on:• NumPy – multi-dimensional arrays and matrices. (ndarray)

• SciPy – key algorithms and functions core to Python's scientific computing capabilities.

• matplotlib - plotting.

• used in academia and industry (Spotify, bit.ly, Evernote).

• http://scikit-learn.org/stable/

Processing features - categorical data

• Using scikit-learn Preprocessing library:

• LabelEncoder – Convert the categorical data to labels• transform ([1, 1, 2, 6]) to ([0, 0, 1, 2])

• Other option, use pandas: pandas.get_dummies()

• OneHotEncoder – Convert to one hot(if order is not important)

• Example:

Fits over the following dataset ([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

transform([[0, 1, 1]]) to [[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]

Processing features - text

• Bag of words (BOW) - Convert a text documents to a vector of word counts (In scikit-learn use CountVectorizer)

stop words are words which are filtered out before or after processing of natural language data (stopwords can be obtained using NLTK)

tf-idf• weighting the counts, so that frequent tokens get lower weight (inverse document

frequency)

• 𝑡𝑓(𝑡, 𝑑)means the term-frequency• Simplest is to use the raw frequency of a term in a document, i.e. the number of

times that term 𝑡 occurs in document 𝑑. (|𝑡 ∈ 𝑑|)

• 𝑖𝑑𝑓(𝑡, 𝐷)means the inverse document-frequency• a measure of how much information the word provides, that is, whether the term is

common or rare across all documents

idf 𝑡, 𝐷 = log𝑁

|𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑|• 𝑁 = |𝐷|

• |𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑| - numbers of documents where the term 𝑡 appears

tf-idf

• The tf − idf feature:tf − idf = tf(𝑡, 𝑑) ⋅ idf(𝑡, 𝐷)

• tf-idf performs better than the counts most of the times

• In Scikit-learn use TfidfVectorizer

Neural networks based models

• Words which share common contexts in the corpus, are located in close proximity to one another in the vector space.

• Packages:

• Word2vec • https://radimrehurek.com/gensim/models/word2vec.html

• Doc2vec (for documents instead of words)• https://radimrehurek.com/gensim/models/doc2vec.html

• fastText• https://github.com/facebookresearch/fastText

• More details in next lectures.

Natural Language Tool Kit(NLTK)

• over 50 corpora and lexical resources

• text processing libraries

• Example: Adding Part of speech (POS) tags

• You should also check Textblob for text processing. Should be simpler.

>>> sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good.""">>> tokens = nltk.word_tokenize(sentence)>>> tokens['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']>>> tagged = nltk.pos_tag(tokens)>>> tagged[0:6]

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),('Thursday', 'NNP'), ('morning', 'NN')]

Cleaning the data

• The features ticket and cabin have many missing values and so can’t add much value to our analysis.

• To handle this we will drop them from the DataFrame to preserve the integrity of our dataset.

df = df.drop(['Ticket','Cabin'], axis=1)

What about names?

• Is there a difference between: Miss, Mrs? Master, Mr?

• Stemming – get the part that is common to all its inflected variants (use NLTK)

• Examples: • cats, catlike , "catty" etc. based on the root "cat“

• Waits, waited, waiting based on the root “wait”

• Other things to consider• Does the specific name really relevant?

• Do capital letters carry information?

• Punctuation symbols?

NaN values

• Naively, just drop any observation containing NaN value• In pandas use df = df.dropna()

• Better – Choose mean, median, most frequent• Check Imputer class in sci-kit learn.

• Even better – predict the missing values, by using the other features• For instance, the Sex, Family Size, Class etc. Can tell us the Age.

• There are many intelligent ways for handling uncertainty in data• Relevant courses - Uncertainty in Databases course by Benny Kimelfield

Labels – (or 𝑌)

• Labels are the data we want to predict given the features.

• In our case, we want to predict the survival column• Single column, binary values

• Survived – 1, Not survived – 0

Types of labels

• Classification Problems:• Single column, binary values

• Review is positive or negative

• Multiple column, binary values

• Multilablel• Review is on kitchen/books/movies

• Regression Problems:• Single column, real values

• What is the salary of 40 years old man.

• Multiple column, real values

1/0

1/0 1/0

214.25

1/2/3…../n

214.25 335.5

Reminder: Regression

• Regression wants to predict a continuous-valued output for an input.

• Data:

• Goal:

𝑓: 𝑋 → 𝑌𝑃 𝑌 𝑋 = 𝑓 𝑥 + 𝜖, 𝜖~𝑁(0, 𝜎)

Noise

Reminder: Linear Regression

• assumes a linear relationship between inputs and outputs

• Data:

• Therefore:

• ⇒ for single feature

predict commute time for a new person, who lives 1.1 miles from campus.

Now, you want to predict commute time for a new person, who lives 1.1 miles from campus.

1.1

~23

How can we find this line?

How can we find this line?

• Define• xi: input, distance from campus• yi: output, commute time

• We want to predict y for an unknown x

• Assume• In general, assume

y = f(x) + ε• For 1-D linear regression,

assumef(x) = w0 + w1x

• We want to learn the parameters w

We can learn w from the observed data by maximizing the conditional likelihood.

Minimizing the least-squares error equal to Maximize conditional likelihood

minimizing least-squares error

Classification

• We’ve seen in the course:• Naïve Bayes• Decision trees• Logistic Regression

Logistic regression is a discriminativeapproach to classification.• Discriminative: directly estimates P(Y|X)

• Only concerned with discriminating (differentiating) between classes Y

• In contrast, naïve Bayes is a generative classifier• Estimates P(Y) & P(X|Y) and uses Bayes’ rule to calculate P(Y|X)

• Explains how data are generated, given class label Y

• Both logistic regression and naïve Bayes use their estimates of P(Y|X) to assign a class to an input X—the difference is in how they arrive at these estimates, and their assumptions.

• Logistic regression doesn’t use the naïve Bayes assumption for training.

Assumption of logistic regression

•Consider learning f: X Y, where

•X is a vector of real-valued features, < X1 … Xn >

•Y is boolean

•assume all Xi are conditionally independent given Y

•model P(Xi | Y = yk) as Gaussian N(ik,i)

•model P(Y) as Bernoulli ()

•What does that imply about the form of P(Y|X)?

• By using Bayes rule, we get the sigmoid function

The logistic function

a

b

Logistic regression models probabilities with the logistic function.

• Want to predict Y=1 for X when P(Y=1|X) ≥ 0.5

Y = 1

Y = 0

P(Y=1|X)

𝑤0 + ∑𝑤𝑖𝑥𝑖 = 0

Maximize the conditional likelihood to find the weights w = [w0,w1,…,wd].

How can we optimize this function?

• Concave

• No closed-form solution for w

• Calculate the gradient:

Gradient descent can optimize differentiable functions.• Suppose you have a differentiable function f(x)

• Gradient descent• Choose starting point 𝑥(0)

• Repeat until no change:

Updated valuefor optimum Previous value

for optimumStep size

Gradient of f,evaluated at current x

Here is the trajectory of gradient descent on a quadratic function.

How does step size affect the result?

Gradient descent can optimize differentiable functions.• Suppose you have a differentiable function f(x)

• Gradient descent• Choose starting point 𝑥(0)

• Repeat until no change:

Updated valuefor optimum Previous value

for optimumStep size

Gradient of f,evaluated at current x

Regularization

• There are no constraints on the search space of 𝑤.

• It might cause the algorithm to over-fit over the training examples.

• We want to add some simple priors on 𝑤 parameters.

MAP instead of MLE

𝑤~𝑁 0,1

2𝜆

How it affects our GD step

𝑤𝑖 ← 𝑤𝑖 − 𝜂𝜆𝑤𝑖2 + 𝜂

𝑙

𝑥𝑖𝑙 𝑌𝑙 − 𝑃 𝑌𝑙 = 1 𝑋𝑙 ,𝑊

Regularization term, takes the weights go to zero.

Why small weights

• We bias the weights to be small.

• ⇒ Simple, The sigmoid function is “less sure”.

𝑠𝑖𝑔𝑚𝑜𝑖𝑑 =1

1 + 𝑒−𝑤𝑥

𝑃 𝑥

𝑥

Back to our problem Visualizing the data• graph of how many survived

• Important to see if data is skewed.

df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart)

Visualizing – understanding the data

df.Age[df.Pclass == 1].plot(kind='kde') df.Age[df.Pclass == 2].plot(kind='kde')df.Age[df.Pclass == 3].plot(kind='kde')

df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart)

Who Survived?

• Woman and high class are more likely to survive.

• Understanding the most basic relationships is essential to build more insightful model

Classification

• It’s very simple to run classification algorithms in scikit learn.

• Simple call:

• You can replace LogisticRegression with any classification class.

clf = LogisticRegression(C=1.0, penalty='l2', random_state=None)

clf.fit(X[train],y[train])

print classifier.score(X[test], y[test])

Popular classification modules

How to choose?

• Also use GridSearchCV

Use pipelines to easily combine techniques

pipeline = Pipeline([

('bow', CountVectorizer(analyzer=split_into_lemmas)), # strings to token integer counts

('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores

('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier

])

Pickle

• Use pickle to save model after training

with open('survival_classifier.pkl', 'wb') as clf_file:

cPickle.dump(clf, clf_file)

Evaluation - Cross validation

• We split the data into two different parts, a training set and a validation set.

KFolds

• Split dataset into k consecutive folds

• Each fold is used as a validation set once while the k - 1 remaining fold form the training set.

Skewed data sets

• Skewed datasets appears in classification problems• When one class is over-represented in the data set

• Example: fraud detection (90% of activity is normal)

Cross validation – skewed datasets

• use stratified splitting

Model in one command

scores = cross_val_score(pipeline, # steps to convert raw messages into models

X, # training data

y, # training labels

cv=10, # split data randomly into 10 parts: 9 for training, 1 for scoring

scoring='accuracy', # which scoring metric?

n_jobs=-1, # -1 = use all cores = faster

)

print scores

Evaluation Metrics

• how are we going to evaluate our results?• what the evaluation metric or objective is?

• Basic Method:

• Problem – Bad metric for skewed datasets

• We should use AUC, more next classes.

top related