classification with naive bayes

Copyright 2011 Cloudera Inc. All rights reserved Classification with Naïve Bayes A Deep Dive into Apache Mahout

Copyright 2011 Cloudera Inc. All rights reserved

Classification with Naïve BayesA Deep Dive into Apache Mahout

Today’s speaker – Josh Patterson

[email protected] / twitter: @jpatanooga

• Master’s Thesis: self-organizing mesh networks– Published in IAAI-09: TinyTermite: A Secure Routing Algorithm

• Conceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff)

– Led small team which designed classification techniques for time series and Map Reduce

– Open source work at

• Now: Solutions Architect at Cloudera


What is Classification?

• Supervised Learning

• We give the system a set of instances to learn from

• System builds knowledge of some structure

– Learns “concepts”

• System can then classify new instances

Supervised vs Unsupervised Learning

• Supervised

– Give system examples/instances of multiple concepts

– System learns “concepts”

– More “hands on”

– Example: Naïve Bayes, Neural Nets

• Unsupervised

– Uses unlabled data

– Builds joint density model

– Example: k-means clustering

Naïve Bayes

• Called Naïve Bayes because its based on “Baye’s Rule” and “naively” assumes independence given the label

– It is only valid to multiply probabilities when the events are independent

– Simplistic assumption in real life

– Despite the name, Naïve works well on actual datasets

Naïve Bayes Classifier

• Simple probabilistic classifier based on

– applying Baye’s theorem (from Bayesian statistics)

– strong (naive) independence assumptions.

– A more descriptive term for the underlying probability model would be “independent feature model".

Naïve Bayes Classifier (2)

• Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. – Example:

• a fruit may be considered to be an apple if it is red, round, and about 4" in diameter.

• Even if these features depend on each other or upon the existence of the other features, a naive Bayesclassifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

A Little Bit o’ Theory

Condensing Meaning

• To train our system we need

– Total number input training instances (count)

– Counts tuples:

• {attributen,outcomeo,valuem}

– Total counts of each outcomeo

• {outcome-count}

• To Calculate each Pr[En|H]– ({attributen,outcomeo,valuem} / {outcome-count} )

…From the Vapor of That Last Big Equation

A Real Example From Witten, et al

Enter Apache Mahout

• What is it?

– Apache Mahout is a scalable machine learning library that supports large data sets

• What Are the Major Algorithm Type?

– Classification

– Recommendation

– Clustering


Mahout Algorithms

Size of Dataset Mahout Algo Execution Model Characteristics

Small SGD Sequential Uses all types of predictor vars

Medium Naïve Bayes / Complementary Naïve Bayes

Parallel Prefers text, high training cost

Large Random Forrest Parallel Uses all type of predictor vars, high training cost

Naïve Bayes and Text

• Naive Bayes does not model text well.

– “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”


– Mahout does some modifications based around TF-IDF scoring (Next Slide)

• Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification

High Level Algorithm

• For Each Feature(word) in each Doc:– Calc: “Weight Normalized Tf-Idf”

• for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf

– We calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0

Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N ) ]

BayesDriver Training Workflow


• BayesFeatureDriver• Compute parts of TF-IDF via Term-Doc-Count, WordFreq, and


2• BayesTfIdfDriver

• Calc the TF-IDF of each feature/word in each label

3• BayesWeightSummerDriver

• Take TF-IDF and Calc Trainer Weights

4• BayesThetaNormalizerDriver

• Calcs norm factor SigmaWij for each complement class

Naïve Bayes Training MapReduce Workflow in Mahout

Logical Classification Process

1. Gather, Clean, and Examine the Training Data

– Really get to know your data!

2. Train the Classifier, allowing the system to “Learn” the “Concepts”

– But not “overfit” to this specific training data set

3. Classify New Unseen Instances

– With Naïve Bayes we’ll calculate the probabilities of each class wrt this instance

How Is Classification Done?

• Sequentially or via Map Reduce


– Creates ClassifierContext

– For Each File in Dir

• For Each Line– Break line into map of tokens

– Feed array of words to Classifier engine for new classification/label

– Collect classifications as output

A Quick Note About Training Data…

• Your classifier can only be as good as the training data lets it be…

– If you don’t do good data prep, everything will perform poorly

– Data collection and pre-processing takes the bulk of the time

Enough Math, Run the Code

• Download and install Mahout


• Run 20Newsgroups Example


– Uses Naïve Bayes Classification

– Download and extract 20news-bydate.tar.gz from the 20newsgroups dataset

Generate Test and Train Dataset

Training Dataset:

mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \-p examples/bin/work/20news-bydate/20news-bydate-train \-o examples/bin/work/20news-bydate/bayes-train-input \-a org.apache.mahout.vectorizer.DefaultAnalyzer \-c UTF-8

Test Dataset:

mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \-p examples/bin/work/20news-bydate/20news-bydate-test \-o examples/bin/work/20news-bydate/bayes-test-input \-a org.apache.mahout.vectorizer.DefaultAnalyzer \-c UTF-8

Train and Test Classifier

Train:$MAHOUT_HOME/bin/mahout trainclassifier \-i 20news-input/bayes-train-input \-o newsmodel \-type bayes \-ng 3 \-source hdfs

Test:$MAHOUT_HOME/bin/mahout testclassifier \-m newsmodel \-d 20news-input \-type bayes \-ng 3 \-source hdfs \-method mapreduce

Other Use Cases

• Predictive Analytics

– You’ll hear this term a lot in the field, especially in the context of SAS

• General Supervised Learning Classification

– We can recognize a lot of things with practice

• And lots of tuning!

• Document Classification

• Sentiment Analysis

• We’re Hiring!

• Cloudera’s Distro of Apache Hadoop:


• Resources

– “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”
