mapreduce and hadoop file system - lib.hfu.edu.tw · what is hadoop? by distributing the data,...
TRANSCRIPT
-
Hadoop+R
-
Outline
Hadoop
R
Hadoop+R
Mongodb
Fluentd
Drill
Demo
2
-
Hadoop
3
-
What is Hadoop? Hadoop is a software platform that lets one easily write and run applications that process vast amount of data.
Hadoop can reliably store and process petabytes. PB
It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
4
-
What is Hadoop? By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This make it extremely rapid. Hadoop
Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. Hadoop
5
-
Apache Hadoop
http://hadoop.apache.org/
6
http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/ -
Hadoop
MapReduce
HDFS
7
-
MapReduce
8
-
MapReduce
(Software Framework)
(Clusters)
(Big Data)
(Map)(Reduce)
JAVAR
9
-
MapReduce
Map
(Master Node) InputInput key/value(Slave Nodes)
Key/Value (Intermediate) key/value
(Kin, Vin) list(Kinter, Vinter)
10
-
MapReduce
Reduce
key Value key/value
Key Value Key/Value
(Kinter, list(Vinter)) list(Kout, Vout)
11
-
HDFS
12
-
Hadoop Distributed File System
(HDFS)
Write-once-read-many
BlockBlock(Replica)DataNode NameNodeHDFS (File
System Namespace) DataNode(Blocks)
13
-
14
Hadoop 2.X
http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/
http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/ -
YARN
Yet Another Resource Negotiator
YARN is a more general purpose
framework of which classic
MapReduce is one application.
YARN MapReduce
15
-
16
Hadoop Ecosystem
http://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-cloudera
http://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-cloudera -
HBase
HBase:
HDFS
Join
Family Columns
17
-
Hive
Developed at Facebook
Relational database built on Hadoop
Maintains list of table schemas
SQL-like query language (HiveQL)
Can call Hadoop Streaming scripts from
HiveQL
18
-
Sqoop
Hadoop MySQL,SQL Server Hadoop HDFS HDFS
19
-
Fluentd
20
Fluentd + MongoDB
-
MongoDB
RDBMS (Big Data) NoSQL
Mongodb10genNoSQLMongo
21
-
Drill
Apache Drill HadoopSQL
22
-
R
23
-
R Ross Ihaka Robert Gentleman (1966)
AT & T S
R WindowsUnixLinux Apple MacOS
http://www.r-project.org
24
http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/ -
R (Function)5000
R (Interpreted Language)
R (Object Oriented Language)
25
-
R
http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html 26
-
IDE
R Studio
http://www.rstudio.com/
R
http://www.r-project.org/
27
http://www.rstudio.com/http://www.rstudio.com/http://www.rstudio.com/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/ -
R
Big Data
28
-
Hadoop+R
29
-
Hadoop+R
R
R Memory ()
Hadoop R
R Hadoop
30
-
RHadoop
Revolution Analytics
MapReduceHDFS HBase
rmr2
rhdfs
rhbase
31
-
RHadoop rhdfs
R HDFS
rmr2
MapReduce
rhbase
HBase
32
-
RHadoop
33
2
-
D e m o
34
-
Database
RHive package R studio
Sqoop
35
-
Sqoop
sqoop import --connect
"jdbc:sqlserver://$SQL_SRV:$PORT;database=mitopac;username=$U
;password=$P" --hive-import -m 1 --table reader --warehouse-dir
$DEST_DIR --map-column-hive reader11=String --hive-overwrite
36
-
D e m o
37
-
Log
-
Fluentd+Mongodb(192.168.244.131)
1.# login root 2.# mongo httpd 3.> db.accesslog.count() 4.> db.accesslog.find() 5.quit()
Drill (192.168.244.131:8047, select * from mongo.httpd. accesslog limit 10)
1. bin/sqlline -u jdbc:drill:zk=local (./start)
2. use mongo.httpd;
3. show tables;
4. select * from accesslog; select host from accesslog;
5. !quit
39