introduction of big data tools · using on-hand database management tools or traditional data...

32
NSLab, RIIT, Tsinghua Univ Introduction of Big Data Tools 2016.4.6

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Introduction of Big Data Tools

2016.4.6

Page 2: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

What is Big Data?

� Wiki: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

� IDC: Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.

2

Page 3: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

What is Big Data?

3

Page 4: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Four V’s of Big Data

4

Page 5: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Framework of Big Data Systems

User Interaction

Processing

Storage

Data Transfer

Collection

5

Page 6: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Distributed File System

6

Page 7: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Hadoop Distributed File System

Storage Functions� Write� Read� Append� Delete� Modify

Requirements� Large-Scale Data� Parallel Processing� Write Once, Read Many� Streaming I/O � Fault Tolerance

7

Page 8: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Hadoop Distributed File System

Design Features� Large Chunks� Metadata in Namenode (Single Master)� Replication� Rack Awareness� Pipelined Write

8

Page 9: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Hadoop Distributed File System

Benefits � Simple design with single master� Fault tolerance� Custom designed

Limitations� Only viable in a specific environment� Limited security

9

Page 10: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

MapReduceLarge-scale Data Processing� Iterate over a large number of records� Extract something of interest from each� Shuffle and sort intermediate results� Aggregate intermediate results� Generate final output

MapReduce Provides� Automatic parallelization & distribution� Fault-tolerance� Status and monitoring tools� A clean abstraction for programmers

10

Page 11: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

MapReduce

11

Page 12: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

MapReduce

12

Page 13: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

MapReduce

Benefits � Transparent� Fault tolerance� Scalable� Load Balanced

Limitations� Scheduling Control� Coding Complexity� Hard Disk I/O

Data Parallel Computing on General Directed Acyclic Graphs� Dryad (Microsoft)� Tez (Apache)

13

Page 14: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Applications based on MR

Apache Pig� High Level Programming Language

Apache Hive� SQL Operation on HDFS

Apache Flume� Transfer Continuous Log or Event Data

Apache Sqoop� Transfer Data from RDBMS

ETL� Extract� Transform� Load

14

Page 15: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Other Data Warehouse Tools

� Spark SQL/Shark (Based on Spark)� Dremel (Google)� Drill (Apache)� Impala (Cloudera)…

15

Page 16: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

In-Memory Processing

� Speed� Capacity & Cost� Fault Tolerance

100 NS

CPU

Core Core

L1 Cache L1 Cache

L2 Cache L2 Cache

L3 Cache

Main Memory

Disk

0.5 NS

7.0 NS

15.0 NS

SSD: 150K NSHD: 10M NS

16

Page 17: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

In-Memory Processing

How To Share Memory?� DSM (Distributed Shared Memory)� Distributed Key-Value Stores

High Cost for Fault Tolerance!

17

Page 18: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Spark

RDDs (Resilient Distributed Datasets)� Parallel Actions Only (Map, Filter, Join, etc.)� Rebuild by Action Logs

18

Page 19: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Spark

Tradeoff of Spark

19

Page 20: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Stream Processing

Streaming Data� Volume� Velocity

F(X +ΔX)=F(X)op H(ΔX)

20

Page 21: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Stream ProcessingRequirements of Stream Processing� Realtime� Fault Tolerance (Data/System)

• Stream (K-V Tuple)• Spout• Bolt• Topology

21

Page 22: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

NoSQL Database

RDBMS vs NoSQLBig Data� Variety� Sparse

� Key-Value� Column-Oriented� Document-Oriented

22

Page 23: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

NoSQL Database

RDBMS� Atomicity� Consistency� Isolation� Durability

NoSQL� Basically Available� Soft-state� Eventually Consistent

CAP� Consistency� Availability� Partition Tolerances

23

Page 24: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

NoSQL Database

� HBase� Cassandra� MongoDB� Accumulo� Redis…

� Tradeoff between consistency and availability.� Weak with complex SQL operations.

24

Page 25: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

ELK

Elasticsearch� Real-Time Full-Text Search

Logstash� Data Collect, Transform and Transport

Kibana� Analysis and Visualization

25

Page 26: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Graph DatabaseOrganic Growth -> Scale Free

26

Page 27: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Graph Database

Social Recommendation

Graph Query Examples

MATCH (person:Person)-[:IS_FRIEND_OF]->(friend), (friend)-[:LIKES]->(restaurant), (restaurant)-[:LOCATED_IN]->(loc:Location),(restaurant)-[:SERVES]->(type:Cuisine)WHERE person.name = ‘Philip’ AND loc.location=‘New York’ AND type.cuisine=‘Sushi’RETURN restaurant.name

27

Page 28: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Graph Analysis

• Page Rank• Triangle Counting• Connected Components• Shortest Distance• Random Walk• Graph Coarsening• Graph Coloring• Minimum Spanning Forest• Community Detection• Collaborative Filtering• Belief Propagation• Named Entity Recognition…

28

Page 29: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Large-Scale Machine LearningTasks–Classification [Predictive]–Clustering [Descriptive]–Association Rule Discovery [Descriptive]–Sequential Pattern Discovery [Descriptive]–Regression [Predictive]–Deviation Detection [Predictive]…

29

Page 30: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Large-Scale Machine LearningFor Specific Algorithms:� YahooLDA (Latent Dirichlet Allocation)� Caffe (Convolutional Neural Network)� Torch → TensorFlow (Tensor Mathematics)…

General Platform� Mahout� Spark MLlib� DMLC @ CMU� Petuum @ CMU� DMTK @ MSRA…

30

Page 31: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Visualization

•Prefuse•Google Refine•Tableau•R•Processing•D3 (JS)•ColorBrewer…

31

Page 32: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies

NSLab, RIIT, Tsinghua Univ

Thank you!

Questions?

32