social computing and big data analytics 社群運算與大數據分析

55
Social Computing and Big Data Analytics 社社社社社社社社社社 1 1042SCBDA03 MIS MBA (M2226) (8628) Wed, 8,9, (15:10- 17:00) (Q201) Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem ( 社社社社社MapReduce 社社Hadoop 社 Spark 社社社社 ) Min-Yuh Day 戴戴戴 Assistant Professor 社社社社社社 Dept. of Information Management , Tamkang University 戴戴戴戴 戴戴戴戴戴戴 Tamkang Univers ity Tamkang University

Upload: david-fox

Post on 20-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

課程大綱 (Syllabus) 週次 (Week) 日期 (Date) 內容 (Subject/Topics) 1 2016/02/17 Course Orientation for Social Computing and Big Data Analytics (社群運算與大數據分析課程介紹) 2 2016/02/24 Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data (資料科學與大數據分析: 探索、分析、視覺化與呈現資料) 3 2016/03/02 Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem (大數據基礎:MapReduce典範、 Hadoop與Spark生態系統)

TRANSCRIPT

Page 1: Social Computing and Big Data Analytics 社群運算與大數據分析

Social Computing and Big Data Analytics社群運算與大數據分析

1

1042SCBDA03MIS MBA (M2226) (8628)

Wed, 8,9, (15:10-17:00) (Q201)

Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem

(大數據基礎:MapReduce典範、 Hadoop 與 Spark生態系統 )

Min-Yuh Day戴敏育Assistant Professor專任助理教授

Dept. of Information Management, Tamkang University淡江大學 資訊管理學系http://mail. tku.edu.tw/myday/

2016-03-02

Tamkang University

Tamkang University

Page 2: Social Computing and Big Data Analytics 社群運算與大數據分析

週次 (Week) 日期 (Date) 內容 (Subject/Topics)1 2016/02/17 Course Orientation for Social Computing and

Big Data Analytics ( 社群運算與大數據分析課程介紹 )

2 2016/02/24 Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data ( 資料科學與大數據分析: 探索、分析、視覺化與呈現資料 )

3 2016/03/02 Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem ( 大數據基礎: MapReduce 典範、 Hadoop 與 Spark 生態系統 )

課程大綱  (Syllabus)

2

Page 3: Social Computing and Big Data Analytics 社群運算與大數據分析

週次 (Week) 日期 (Date) 內容 (Subject/Topics)4 2016/03/09 Big Data Processing Platforms with SMACK:

Spark, Mesos, Akka, Cassandra and Kafka ( 大數據處理平台 SMACK : Spark, Mesos, Akka, Cassandra, Kafka)

5 2016/03/16 Big Data Analytics with Numpy in Python (Python Numpy 大數據分析 )

6 2016/03/23 Finance Big Data Analytics with Pandas in Python (Python Pandas 財務大數據分析 )

7 2016/03/30 Text Mining Techniques and Natural Language Processing ( 文字探勘分析技術與自然語言處理 )

8 2016/04/06 Off-campus study ( 教學行政觀摩日 )

課程大綱  (Syllabus)

3

Page 4: Social Computing and Big Data Analytics 社群運算與大數據分析

週次 (Week) 日期 (Date) 內容 (Subject/Topics)9 2016/04/13 Social Media Marketing Analytics

( 社群媒體行銷分析 )10 2016/04/20 期中報告 (Midterm Project Report)11 2016/04/27 Deep Learning with Theano and Keras in Python

(Python Theano 和 Keras 深度學習 )12 2016/05/04 Deep Learning with Google TensorFlow

(Google TensorFlow 深度學習 )13 2016/05/11 Sentiment Analysis on Social Media with

Deep Learning ( 深度學習社群媒體情感分析 )

課程大綱  (Syllabus)

4

Page 5: Social Computing and Big Data Analytics 社群運算與大數據分析

週次 (Week) 日期 (Date) 內容 (Subject/Topics)14 2016/05/18 Social Network Analysis ( 社會網絡分析 )15 2016/05/25 Measurements of Social Network ( 社會網絡量測 )16 2016/06/01 Tools of Social Network Analysis

( 社會網絡分析工具 )17 2016/06/08 Final Project Presentation I ( 期末報告 I)18 2016/06/15 Final Project Presentation II ( 期末報告 II)

課程大綱  (Syllabus)

5

Page 6: Social Computing and Big Data Analytics 社群運算與大數據分析

2016/03/02Fundamental Big Data: MapReduce Paradigm,

Hadoop and Spark Ecosystem (大數據基礎:

MapReduce典範、Hadoop 與 Spark生態系

統 )6

Page 7: Social Computing and Big Data Analytics 社群運算與大數據分析

Architecture of Big Data Analytics

7Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications

Data Mining

OLAP

Reports

QueriesHadoop

MapReducePig

HiveJaql

ZookeeperHbase

CassandraOozieAvro

MahoutOthers

Middleware

Extract Transform

Load

Data Warehouse

Traditional Format

CSV, Tables

* Internal

* External

* Multiple formats

* Multiple locations

* Multiple applications

Big Data Sources

Big Data Transformation

Big Data Platforms & Tools

Big Data Analytics

Applications

Big Data Analytics

Transformed Data

Raw Data

Page 8: Social Computing and Big Data Analytics 社群運算與大數據分析

Business Intelligence (BI) Infrastructure

8Source: Kenneth C. Laudon & Jane P. Laudon (2014), Management Information Systems: Managing the Digital Firm, Thirteenth Edition, Pearson.

Page 9: Social Computing and Big Data Analytics 社群運算與大數據分析

Fundamental Big Data: MapReduce Paradigm,

Hadoop and Spark Ecosystem

9

Page 10: Social Computing and Big Data Analytics 社群運算與大數據分析

10Source: https://www.thalesgroup.com/en/worldwide/big-data/big-data-big-analytics-visual-analytics-what-does-it-all-mean

Page 11: Social Computing and Big Data Analytics 社群運算與大數據分析

MapReduce Paradigm

11

Page 12: Social Computing and Big Data Analytics 社群運算與大數據分析

MapReduce Paradigm

12

Big Data

Map0 Map1 Map2 Map3

Reduce0 Reduce1 Reduce2 Reduce3

Map

ReduceMapReduce Data

Output Data

Page 13: Social Computing and Big Data Analytics 社群運算與大數據分析

Hadoop Ecosystem

13

Page 14: Social Computing and Big Data Analytics 社群運算與大數據分析

The Apache™ Hadoop® project develops open-source software for

reliable, scalable, distributed computing.

14Source: http://hadoop.apache.org/

Page 15: Social Computing and Big Data Analytics 社群運算與大數據分析

15

HDFS

MapReduce Processing

Storage

Source: http://hadoop.apache.org/

Page 16: Social Computing and Big Data Analytics 社群運算與大數據分析

Big Data with Hadoop Architecture

16Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf

Page 17: Social Computing and Big Data Analytics 社群運算與大數據分析

17

Big Data with Hadoop ArchitectureLogical ArchitectureProcessing: MapReduce

Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf

Page 18: Social Computing and Big Data Analytics 社群運算與大數據分析

18

Big Data with Hadoop ArchitectureLogical Architecture

Storage: HDFS

Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf

Page 19: Social Computing and Big Data Analytics 社群運算與大數據分析

19

Big Data with Hadoop ArchitectureProcess Flow

Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf

Page 20: Social Computing and Big Data Analytics 社群運算與大數據分析

20

Big Data with Hadoop ArchitectureHadoop Cluster

Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf

Page 21: Social Computing and Big Data Analytics 社群運算與大數據分析

Hadoop Ecosystem

21Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

Page 22: Social Computing and Big Data Analytics 社群運算與大數據分析

HDP (Hortonworks Data Platform)A Complete Enterprise Hadoop Data Platform

22Source: http://hortonworks.com/hdp/

Page 23: Social Computing and Big Data Analytics 社群運算與大數據分析

Apache HadoopHortonworks Data Platform

23Source: http://hortonworks.com/hdp/

Page 24: Social Computing and Big Data Analytics 社群運算與大數據分析

Hadoop and Data Analytics Tools

24Source: http://hortonworks.com/hdp/

Page 25: Social Computing and Big Data Analytics 社群運算與大數據分析

Hadoop 1 Hadoop 2

25Source: http://hortonworks.com/hadoop/tez/

Page 26: Social Computing and Big Data Analytics 社群運算與大數據分析

Big Data Solution

26Source: http://www.newera-technologies.com/big-data-solution.html

Page 27: Social Computing and Big Data Analytics 社群運算與大數據分析

Traditional ETL Architecture

27Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf

Page 28: Social Computing and Big Data Analytics 社群運算與大數據分析

28Source: https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf

Offload ETL with Hadoop (Big Data Architecture)

Page 29: Social Computing and Big Data Analytics 社群運算與大數據分析

Spark Ecosystem

29

Page 30: Social Computing and Big Data Analytics 社群運算與大數據分析

Apache Spark is a fast and general engine

for large-scale data processing.

30

Lightning-fast cluster computing

Source: http://spark.apache.org/

Page 31: Social Computing and Big Data Analytics 社群運算與大數據分析

Logistic regression in Hadoop and Spark

31

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Source: http://spark.apache.org/

Page 32: Social Computing and Big Data Analytics 社群運算與大數據分析

Ease of Use

• Write applications quickly in Java, Scala, Python, R.

32Source: http://spark.apache.org/

Page 33: Social Computing and Big Data Analytics 社群運算與大數據分析

Word count in Spark's Python API

text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b)

33Source: http://spark.apache.org/

Page 34: Social Computing and Big Data Analytics 社群運算與大數據分析

Spark and Hadoop

34Source: http://spark.apache.org/

Page 35: Social Computing and Big Data Analytics 社群運算與大數據分析

Spark Ecosystem

35Source: http://spark.apache.org/

Page 36: Social Computing and Big Data Analytics 社群運算與大數據分析

Spark Ecosystem

36Source: Mike Frampton (2015), Mastering Apache Spark, Packt Publishing

Spark

GraphX(graph)

SparkSQL

Mllib(machine learning)

SparkStreaming

Kafka Flume H2O Hive

Cassandra

Titan

HBase

HDFS

Page 37: Social Computing and Big Data Analytics 社群運算與大數據分析

Hadoop vs. Spark

37Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

Iter. 1

Iter. 1

Iter. 2

Iter. 2

Input

Input

HDFSread

HDFSread

HDFSwrite

HDFSwrite

Page 38: Social Computing and Big Data Analytics 社群運算與大數據分析

Steps to Install Hadoop

on a Personal Computer

(Windows/OS X)

38Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5

Page 39: Social Computing and Big Data Analytics 社群運算與大數據分析

Hodoop: Linux Based Software

39

LINUX

LINUX

LINUX

LINUX

Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5

Page 40: Social Computing and Big Data Analytics 社群運算與大數據分析

Appliance

40

HadoopLinux

Virtual Machine (VirtualBox / VMWare)

Personal Computer (Windows / OS X)

Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5

Page 41: Social Computing and Big Data Analytics 社群運算與大數據分析

Connection to Hadoop

41

HadoopLinux

Virtual Machine (VirtualBox / VMWare)

Personal Computer (Windows / OS X)

Browser

Access from host

Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5

Page 42: Social Computing and Big Data Analytics 社群運算與大數據分析

Steps to Install Hadoop on a Personal Computer (Windows/OS X)

42Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5

Step 1. Download and Install VirtualBox

Step 2. Download Appliance

Step 3. Import Appliance

Step 4. Configure Virtual Machine (VM)

Step 5. Start Virtual Machine (VM)

Step 6. Test Connection From Host

Page 43: Social Computing and Big Data Analytics 社群運算與大數據分析

Virtual Box

43https://www.virtualbox.org/

Page 44: Social Computing and Big Data Analytics 社群運算與大數據分析

Steps to Install Hadoop on a Personal Computer (Windows/OS X)

44Source: https://www.youtube.com/watch?v=rO-V1mxhzcM&list=PLyZEf-TOnZen8E5m5TIpIsdok2fyKDNRa&index=5

Step 1. Download and Install VirtualBox

Step 2. Download Appliance

Step 3. Import Appliance

Step 4. Configure Virtual Machine (VM)

Step 5. Start Virtual Machine (VM)

Step 6. Test Connection From Host

Hortonworks Sandbox

Page 45: Social Computing and Big Data Analytics 社群運算與大數據分析

Hortonworks SandboxThe easiest way to get started with Enterprise Hadoop

45http://hortonworks.com/products/hortonworks-sandbox/#install

Page 46: Social Computing and Big Data Analytics 社群運算與大數據分析

Get started on Hadoop with these tutorials based on the Hortonworks Sandbox

46http://hortonworks.com/tutorials/

Page 47: Social Computing and Big Data Analytics 社群運算與大數據分析

Apache Hadoop

47http://hadoop.apache.org/

Page 48: Social Computing and Big Data Analytics 社群運算與大數據分析

48

Apache Hadoophttp://hadoop.apache.org/releases.html#Download

Page 49: Social Computing and Big Data Analytics 社群運算與大數據分析

49

Apache Hadoop

Source: http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/releasenotes.html

Page 50: Social Computing and Big Data Analytics 社群運算與大數據分析

Apache Hadoop 2.7.2

50Source: http://hadoop.apache.org/docs/r2.7.2/

Page 51: Social Computing and Big Data Analytics 社群運算與大數據分析

Hadoop: Setting up a Single Node Cluster

51Source: http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html

Page 52: Social Computing and Big Data Analytics 社群運算與大數據分析

Hadoop Cluster Setup

52Source: http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/ClusterSetup.html

Page 53: Social Computing and Big Data Analytics 社群運算與大數據分析

Apache Hadoop YARN

53Source: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Page 54: Social Computing and Big Data Analytics 社群運算與大數據分析

Apache Spark

54http://spark.apache.org/

Page 55: Social Computing and Big Data Analytics 社群運算與大數據分析

References• EMC Education Services (2015),

Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley

• Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

• Mike Frampton (2015), Mastering Apache Spark, Packt Publishing

• Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics, http://www.slideshare.net/deepakramanathan/sas-modernization-architectures-big-data-analytics

55