mapreduce and hadoop file system - lib.hfu.edu.tw · what is hadoop? by distributing the data,...

39
Hadoop+R應用於資料分析

Upload: others

Post on 30-Aug-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Hadoop+R

  • Outline

    Hadoop

    R

    Hadoop+R

    Mongodb

    Fluentd

    Drill

    Demo

    2

  • Hadoop

    3

  • What is Hadoop? Hadoop is a software platform that lets one easily write and run applications that process vast amount of data.

    Hadoop can reliably store and process petabytes. PB

    It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.

    4

  • What is Hadoop? By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This make it extremely rapid. Hadoop

    Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. Hadoop

    5

  • Apache Hadoop

    http://hadoop.apache.org/

    6

    http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/
  • Hadoop

    MapReduce

    HDFS

    7

  • MapReduce

    8

  • MapReduce

    (Software Framework)

    (Clusters)

    (Big Data)

    (Map)(Reduce)

    JAVAR

    9

  • MapReduce

    Map

    (Master Node) InputInput key/value(Slave Nodes)

    Key/Value (Intermediate) key/value

    (Kin, Vin) list(Kinter, Vinter)

    10

  • MapReduce

    Reduce

    key Value key/value

    Key Value Key/Value

    (Kinter, list(Vinter)) list(Kout, Vout)

    11

  • HDFS

    12

  • Hadoop Distributed File System

    (HDFS)

    Write-once-read-many

    BlockBlock(Replica)DataNode NameNodeHDFS (File

    System Namespace) DataNode(Blocks)

    13

  • 14

    Hadoop 2.X

    http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/

    http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/
  • YARN

    Yet Another Resource Negotiator

    YARN is a more general purpose

    framework of which classic

    MapReduce is one application.

    YARN MapReduce

    15

  • 16

    Hadoop Ecosystem

    http://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-cloudera

    http://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-clouderahttp://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins-charles-zedlewski-cloudera
  • HBase

    HBase:

    HDFS

    Join

    Family Columns

    17

  • Hive

    Developed at Facebook

    Relational database built on Hadoop

    Maintains list of table schemas

    SQL-like query language (HiveQL)

    Can call Hadoop Streaming scripts from

    HiveQL

    18

  • Sqoop

    Hadoop MySQL,SQL Server Hadoop HDFS HDFS

    19

  • Fluentd

    20

    Fluentd + MongoDB

  • MongoDB

    RDBMS (Big Data) NoSQL

    Mongodb10genNoSQLMongo

    21

  • Drill

    Apache Drill HadoopSQL

    22

  • R

    23

  • R Ross Ihaka Robert Gentleman (1966)

    AT & T S

    R WindowsUnixLinux Apple MacOS

    http://www.r-project.org

    24

    http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/
  • R (Function)5000

    R (Interpreted Language)

    R (Object Oriented Language)

    25

  • R

    http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html 26

  • IDE

    R Studio

    http://www.rstudio.com/

    R

    http://www.r-project.org/

    27

    http://www.rstudio.com/http://www.rstudio.com/http://www.rstudio.com/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/
  • R

    Big Data

    28

  • Hadoop+R

    29

  • Hadoop+R

    R

    R Memory ()

    Hadoop R

    R Hadoop

    30

  • RHadoop

    Revolution Analytics

    MapReduceHDFS HBase

    rmr2

    rhdfs

    rhbase

    31

  • RHadoop rhdfs

    R HDFS

    rmr2

    MapReduce

    rhbase

    HBase

    32

  • RHadoop

    33

    2

  • D e m o

    34

  • Database

    RHive package R studio

    Sqoop

    35

  • Sqoop

    sqoop import --connect

    "jdbc:sqlserver://$SQL_SRV:$PORT;database=mitopac;username=$U

    ;password=$P" --hive-import -m 1 --table reader --warehouse-dir

    $DEST_DIR --map-column-hive reader11=String --hive-overwrite

    36

  • D e m o

    37

  • Log

  • Fluentd+Mongodb(192.168.244.131)

    1.# login root 2.# mongo httpd 3.> db.accesslog.count() 4.> db.accesslog.find() 5.quit()

    Drill (192.168.244.131:8047, select * from mongo.httpd. accesslog limit 10)

    1. bin/sqlline -u jdbc:drill:zk=local (./start)

    2. use mongo.httpd;

    3. show tables;

    4. select * from accesslog; select host from accesslog;

    5. !quit

    39