random forest using apache mahout

20
CS 267 : Data Mining Presentation Guided by : Dr. Tran -Gaurav Kasliwal

Upload: gaurav-kasliwal

Post on 26-Jan-2015

128 views

Category:

Education


3 download

DESCRIPTION

Random Forest Model using Apache Mahout

TRANSCRIPT

Page 1: Random forest using apache mahout

CS 267 : Data Mining Presentation

Guided by : Dr. Tran

-Gaurav Kasliwal

Page 2: Random forest using apache mahout

Outline RandomForest Model

Mahout Overview

RandomForest using Mahout

Problem Description

Working Environment

Data Preparation

ML Model Generation

Demo

Using Gini Index

Page 3: Random forest using apache mahout

RandomForest Model Random forests are an ensemble learning method

for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.

Developed by Leo Breiman and Adele Cutler.

Page 4: Random forest using apache mahout

Mahout

Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoopand using the MapReduce paradigm.

Scalable to large data sets

Page 5: Random forest using apache mahout

RandomForest using Mahout Generate a file descriptor for the dataset.

Run the example with train data and build Decision Forest model.

Use the Decision Forest model to Classify test data and get results.

Tuning the model to get better results.

Page 6: Random forest using apache mahout

Problem Definition To Benchmark machine learning model for Page-Rank Yahoo! Learning to Rank

Train Data : 34815 Records Test Data : 130166 Records

Data Description : {R} | {q_id} | {List: feature_id -> feature_value} where R = {0, 1, 2, 3, 4} q_id = query id (number) feature_id = number feature_value = 0 to 1

Page 7: Random forest using apache mahout

Working Environment Ubuntu

Hadoop 1.2.1

Mahout 0.9

Page 8: Random forest using apache mahout

Prepare Dataset Take data from input text file

Make a .csv file

Make directory in HDFS and upload train.csv and test.csv to the folder.

Data Loading (Load data to HDFS)

#hadoop fs -put train.arff final_data

#hadoop fs -put test.arff final_data

#hadoop fs -ls final_data (check by ls command )

Page 9: Random forest using apache mahout

Using Mahoutmake metadata:

#hadoop jar mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p final_data/train.csv -f final_data/train.info1 -d 702 N L

It creates a metadata train.info1 in final_data folder.

Page 10: Random forest using apache mahout

Create Modelmake model

#hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -sl 5 -p -t 100 -o final-forest

Page 11: Random forest using apache mahout

Test Modeltest model

#hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -p -t 1000 -o final-forest

Page 12: Random forest using apache mahout

Results

Summary results : Confusion Matrix and statistics

Page 13: Random forest using apache mahout

Tuning

(change the parameters -t and -sl) and check the results.

--nbtrees (-t) nbtrees Number of trees to grow

--selection (-sl) m Number of variables to select randomly at each tree-node.

Page 14: Random forest using apache mahout

Results #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -

Dmapred.max.split.size=1874231 -d final_data/train.csv -ds final_data/train.info1 -sl 700 -p -t 600 -o final-forest2

#hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -ifinal_data/test.csv -ds final_data/train.info1 -m final-forest2 -a -mr -o final-pred2

Page 15: Random forest using apache mahout

RF Split selection Typically we select about square root (K) when there

are K is the total number of predictors available

If we have 500 columns of predictors we will select only about 23

We split our node with the best variable among the 23, not the best variable among the 500

Page 16: Random forest using apache mahout

Using Gini Index If a dataset T is split into two subsets T1 and T2 with

sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index (T) is defined as:

**The attribute value that provides the smallest SPLIT Gini (T) is chosen to split the node.

Page 17: Random forest using apache mahout

Example The example below shows the construction of a single

tree using the dataset .

Only two of the original four attributes are chosen for this tree construction.

Page 18: Random forest using apache mahout
Page 19: Random forest using apache mahout

tabulates the gini index value for the HOME_TYPE attribute at all possible splits.

the split HOME_TYPE <= 10 has the lowest value

Gini SPILT Value

Gini SPILT(HOME_TYPE<=6) 0.4000

Gini SPILT(HOME_TYPE<=10) 0.2671

Gini SPILT(HOME_TYPE<=15) 0.4671

Gini SPILT(HOME_TYPE<=30) 0.3000

Gini SPILT(HOME_TYPE<=31) 0.4800

Page 20: Random forest using apache mahout