big data analytics with r and hadoop chapter 5 learning data analytics with r and hadoop...

41
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데데데데데데데데데 2015.04.23 데데데

Upload: donald-potter

Post on 24-Dec-2015

233 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Big data analytics with R and HadoopChapter 5 Learning Data Analytics with R and

Hadoop

데이터마이닝연구실2015.04.23

김지연

Page 2: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Content

• Understanding the data analytics project life cycle• Understanding data analytics problems

– Exploring web pages categorization– Computing the frequency of stock market change– Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

Page 3: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding the data analytics project life cycle

1. Identifying the problem2. Designing data requirement3. Preprocessing data4. Performing analytics over data5. Visualizing data

Learning Data Analytics with R and Hadoop

Page 4: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding the data analytics project life cycle

1. Identifying the problem• business analytics trends change by performing data analytics

over web datasets for growing business• data analytical application needs to be scalable for collecting

insights from their datasets• If we want to know how to increase the business

identify the important pages of our website by categorizing them based on these popular pages, their types, their traffic sources, and

their content we will be able to decide the roadmap to improve business by im-

proving web traffic(content)

Learning Data Analytics with R and Hadoop

Page 5: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding the data analytics project life cycle

2. Designing data requirement• to perform the data analytics, if needs datasets from related

domains• social media analytics (problem specification)

use the data source as Facebook or Twitter For identifying the user characteristics, we need user profile infor-

mation, likes, and posts as data attributes.

Learning Data Analytics with R and Hadoop

Page 6: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding the data analytics project life cycle

3. Preprocessing data• data cleansing• data aggregation• data augmentation• data sorting• data formatting

• Big Data the datasets need to be formatted and uploaded to HDFS used various nodes with Mappers and Reducers in Hadoop clusters.

Learning Data Analytics with R and Hadoop

Page 7: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding the data analytics project life cycle

4. Performing analytics over data• various machine learning(custom algorithmic concepts)

Regression Classification Clustering model-based recommendation

• Big Data the same algorithms can be translated to MapReduce algorithms

for running them on Hadoop clusters by translating their data ana-lytics logic to the MapReduce job which is to be run over Hadoop clusters.

Learning Data Analytics with R and Hadoop

Page 8: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding the data analytics project life cycle

5. Visualizing data• Ggplot2 • rCharts

Learning Data Analytics with R and Hadoop

Page 9: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

1. Identifying the problem• To identify the category of a web page of a website based on

the visit count of the pages• To identify the importance of web pages designed for websites

based on the content, design, or visits of the lower popular pages can be improved or increased.

Page 10: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

2. Designing data requirement• Use Google Analytics dataset

date: This is the date of the day when the web page was visited source: This is the referral to the web page pageTitle: This is the title of the web page pagePath: This is the URL of the web page

Page 11: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

2. Designing data requirement• the code for the extraction process from Google Analytics

Page 12: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

2. Designing data requirement• the code for the extraction process from Google Analytics

Page 13: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

3. Preprocessing data

Page 14: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• Initialize by setting Hadoop variable & loading the RHadoop li-

brary

• Upload the datasets to HDFS

Page 15: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• MapReduce 1

Page 16: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• MapReduce 1

Page 17: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• MapReduce 2

Page 18: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• MapReduce 2

Page 19: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

5. Visualizing data• the web page categorization output using the three categories• if we have more information, such as sources,

we can represent the web pages as nodes of a graph, colored by popularity with directed edges when users follow the links

Page 20: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Computing the frequency of stock market change

Learning Data Analytics with R and Hadoop

1. Identifying the problem• it will calculate the frequency of past changes for one particular symbol

of the stock market, such as a Fourier Transformation• the investor can get more insights on changes for different time peri-

ods• To calculate the frequencies of percentage change

Page 21: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Computing the frequency of stock market change

Learning Data Analytics with R and Hadoop

2. Designing data requirement• Use Yahoo! Finance as the input dataset

From month From day From year To month To day To year Symbol

Page 22: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Computing the frequency of stock market change

Learning Data Analytics with R and Hadoop

3. Preprocessing data• To perform the analytics over the extracted datasetstock_BP <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=BP")write.csv(stock_BP,"table.csv", row.names=FALSE)

• uploading table.csv to hdfsbin/hadoop dfs -put /usr/jyk/table.csv /input/

Page 23: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Computing the frequency of stock market change

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• Mapper : stock_mapper.R

options(warn=-1) input<-file("stdin","r")while(length(currentLine<-readLines(input,n=1,warn=FALSE))>0){ fields<-unlist(strsplit(currentLine,",")) open<-as.double(fields[2]) close<-as.double(fields[5]) change<-(close-open)write(paste(change,1,sep="\t"),stdout())}close(input)

Page 24: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Computing the frequency of stock market change

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• Reducer: stock_reducer.R

current.key<-NAcurrent.val<-0.0conn<-file("stdin","r")while(length(next.line<-readLines(conn,n=1))>0){ split.line<-strsplit(next.line,"\t") key<-split.line[[1]][1] val<-as.numeric(split.line[[1]][2]) if(is.na(current.key)){ current.key<-key current.val<-val } else{ if(current.key==key){ current.val<-current.val+val } else{ write(paste(current.key,current.val,sep="\t"),stdout()) current.key<-key current.val<-val } }}write(paste(current.key,current.val,sep="\t"),stdout())close(conn)

Page 25: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Computing the frequency of stock market change

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• MapReduce/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-streaming-2.5.0-mr1-cdh5.3.3.jar \-input input/table.csv \-output outputs \-file /home/jyk/Documents/stock_mapper.R \-mapper /home/jyk/Documents/stock_mapper.R \-file /home/jyk/Documents/stock_reducer.R \-reducer /home/jyk/Documents/stock_reducer.R

Page 26: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Computing the frequency of stock market change

Learning Data Analytics with R and Hadoop

4. Performing analytics over data

Page 27: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Computing the frequency of stock market change

Learning Data Analytics with R and Hadoop

4. Performing analytics over data

Page 28: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Exploring web pages categorization

Learning Data Analytics with R and Hadoop

5. Visualizing datalibrary(ggplot2)myStockData <- read.delim("stock_output.txt", header=F, sep="", dec=".")ggplot(myStockData, aes(x=V1, y=V2)) + geom_smooth() + geom_point()

Page 29: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

1. Identifying the problem• How large datasets can be resampled & applied the random

forest model with R and Hadoop• To predict the sale price of a particular piece of heavy equip-

ment at a usage auction based on its usage, equipment type, and configuration

Page 30: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

2. Designing data requirement• Use Kaggle competition

http://www.kaggle.com/c/bluebook-for-bulldozers

File name Description format (size)Train This is a training set that contains data for 2011.

Valid This is a validation set that contains data from Janu-ary 1, 2012 to April 30, 2012.

Data dictionary This is the metadata of the training dataset variables.

Machine_Appendix This contains the correct year of manufacturing for a given machine along with the make, model, and product classdetails.

Test This tests datasets.

random_forest_benchmark_test

This is the benchmark solution provided by the host.

Page 31: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

3. Preprocessing data• Loading Train.csv dataset & Machine_Appendix.csv

Page 32: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

3. Preprocessing data• Add a few features & merge

Page 33: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• Random sampling

N data points in our initial training set A set of M different models for an ensemble classifier Each of the M models will be fitted with K data points

• Poisson sampling KM < N: we are not using the full amount of data available to us KM = N: we can exactly partition our dataset to produce totally in-

dependent samples KM > N: we must resample some of our data with replacements

Page 34: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• Poisson sampling

the generation of independent samples by using N training input points

three parameters : N, M, and K where K is fixed T=K/N to eliminate the need for the value of N in advance

K / N-average fraction of input data in each model 10%T = frac.per.model = 0.1

number of modelsM = num.models = 50

Page 35: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• Fitting random forests

Normal fitting Over fitting Under fitting

Page 36: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• Mapper

Page 37: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• Reducer

Page 38: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• MapReducer

Page 39: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

4. Performing analytics over data• Each of the 50 samples produced a random forest with 10

trees, so the final random forest is a collection of 500 trees, fit-ted in a distributed fashion over a Hadoop cluster.

Page 40: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Understanding data analytics problems- Predicting the sale price of blue book for bulldozers

Learning Data Analytics with R and Hadoop

5. Visualizing data

Page 41: Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Thank you