large data analysis using rhipe/rhadoop - 统计之都 · pdf filelarge data analysis using...
TRANSCRIPT
Large Data Analysis Using Rhipe/RHadoop
赵扬 (Kevin)赵扬 (Kevin)
Behavioral Insights and Science Team
11/2/2013
Agenda
1. What is Rhipe/RHadoop?
Introduce the basic structure of R+Hadoop platform
2. Why do we use Rhipe/RHadoop?
Advantages/Disadvantages of using Rhipe
3. How to use Rhipe/RHadoop?
2
3. How to use Rhipe/RHadoop?
Share 2 specific user cases about data analysis using Rhipe
4. How to learn R+Hadoop more efficiently
Learn Rhipe step by step, easy and fast
What is Rhipe/RHadoop?
• Rhipe/RHadoop is just a R package which provides an API(应用程序接口) to use Hadoop.
• RHadoop and Rhipe(“hree-pay”)
� RHadoop is comprised of three packages and developed by Revolution Analytics.
1. RHDFS
2. RHBASE
3. rmr
� RHIPE is developed by my classmate Saptarshi Guha in Purdue University:
1. Same idea of Rhadoop, different API design
2. More Integrated with plots(package: Trelliscope )
3. I have used Rhipe to do project for 2 years when I pursuing my Phd degree, it works perfect!
4
What is Rhipe/RHadoop?
Who is the prefect user of Rhipe/RHadoop?
1. Familiar with R
2. Know some basic statistical knowledge(mean, max, min)2. Know some basic statistical knowledge(mean, max, min)
3. Get general idea about Hadoop MapReduce framework
5
Idea of Hadoop
What is Hadoop? Hadoop is used for parallel computing? (HDFS+MapReduce)
1. Divide(map) and Recombine(reduce)
2. Every mapper/Reducer has key-value pairs
Concrete user case: Parallel compute average student weights in a school by gender
6
Why do we use Rhipe/RHadoop?
Rhipe Hadoop(java) Pig/scoobi/cascading
Snowfall /multicore( R parallelpacakges)
User Friendly
Advantages of Rhipe
8
Handle Large data set
Apply statistical algorithm
Large Data visualization
Why do we use Rhipe/RHadoop?
Drawbacks of Rhipe
1. Need to write R code in map reduce format(learning curve)
2. Not as mature and stable as
9
2. Not as mature and stable as Hadoop(java), pig.
3. Need more formal R packages for statistical algorithm computation in large data set.
How to use Rhipe/RHadoop?
• Main page is here: http://www.datadr.org/index.html
• Rhipe installation
1. Install Hadoop on clusters
2. Install R as a shared library. This must be either installed on each of the nodes, or packaged as a zip to be passed to the nodes for each job.
3. Install Google Protocol Buffers
4. Set Environment Variables like HADOOP Path and R Path
5. Install Rhipe package
11
Word Count example using Rhipe
1. example=list()
2. # map step
3. example$map = expression({
4. words = unlist(strsplit(unlist(map.values), “ “))
5. lapply(words,function(r){rhcollect(r,1)})
6. })
7. #reduce step
8. example$reduce = expression(
9. pre = { total = 0 },
10. reduce = { total = total + sum(unlist(reduce.values )) },
11. post = { rhcollect(reduce.key,total) }
12. )
13. #set arguments and run
14. example$ifolder = input_path
15. example$ofolder = output_path
16. example$inout = c("text" ,"sequence")
17. example$jobname = "wordCount”
18. example$zips = zips
19. mr = do.call("rhmr",example)
20. ex = rhex(mr)
12
More complex case to use Rhipe
• Background: We monitored DNS transactions from J-root server in US.
• Data set: – 3 days period(Do analysis on 6 months finally)
– 109,341,181 transactions.
• Goal: – Monitor data status– Monitor data status
– Find out Hacker attacks if any
13
More complex case to use Rhipe
Steps to do data analysis in Rhipe for very large data set(3 steps):
1. Read raw text files into HDFS and create R data base using Rhipe
2. Grab key-value pairs from data base and do map-reduce jobs to get summary resultssummary results
3. Data visualization
14
How to learn R+Hadoop more efficiently
1. start with Ryan Hafen’smanual(easy to interpret):
http://ml.stat.purdue.edu/rhafen/rhipe/
18
How to learn R+Hadoop more efficiently
2. Go through examples from Rhipe manual written by SaptarshiGuha
http://www.datadr.or
19
http://www.datadr.org/doc/index.html