![Page 1: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/1.jpg)
NSLab, RIIT, Tsinghua Univ
Introduction of Big Data Tools
2016.4.6
![Page 2: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/2.jpg)
NSLab, RIIT, Tsinghua Univ
What is Big Data?
� Wiki: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
� IDC: Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.
2
![Page 3: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/3.jpg)
NSLab, RIIT, Tsinghua Univ
What is Big Data?
3
![Page 4: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/4.jpg)
NSLab, RIIT, Tsinghua Univ
Four V’s of Big Data
4
![Page 5: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/5.jpg)
NSLab, RIIT, Tsinghua Univ
Framework of Big Data Systems
User Interaction
Processing
Storage
Data Transfer
Collection
5
![Page 6: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/6.jpg)
NSLab, RIIT, Tsinghua Univ
Distributed File System
6
![Page 7: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/7.jpg)
NSLab, RIIT, Tsinghua Univ
Hadoop Distributed File System
Storage Functions� Write� Read� Append� Delete� Modify
Requirements� Large-Scale Data� Parallel Processing� Write Once, Read Many� Streaming I/O � Fault Tolerance
7
![Page 8: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/8.jpg)
NSLab, RIIT, Tsinghua Univ
Hadoop Distributed File System
Design Features� Large Chunks� Metadata in Namenode (Single Master)� Replication� Rack Awareness� Pipelined Write
8
![Page 9: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/9.jpg)
NSLab, RIIT, Tsinghua Univ
Hadoop Distributed File System
Benefits � Simple design with single master� Fault tolerance� Custom designed
Limitations� Only viable in a specific environment� Limited security
9
![Page 10: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/10.jpg)
NSLab, RIIT, Tsinghua Univ
MapReduceLarge-scale Data Processing� Iterate over a large number of records� Extract something of interest from each� Shuffle and sort intermediate results� Aggregate intermediate results� Generate final output
MapReduce Provides� Automatic parallelization & distribution� Fault-tolerance� Status and monitoring tools� A clean abstraction for programmers
10
![Page 11: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/11.jpg)
NSLab, RIIT, Tsinghua Univ
MapReduce
11
![Page 12: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/12.jpg)
NSLab, RIIT, Tsinghua Univ
MapReduce
12
![Page 13: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/13.jpg)
NSLab, RIIT, Tsinghua Univ
MapReduce
Benefits � Transparent� Fault tolerance� Scalable� Load Balanced
Limitations� Scheduling Control� Coding Complexity� Hard Disk I/O
Data Parallel Computing on General Directed Acyclic Graphs� Dryad (Microsoft)� Tez (Apache)
13
![Page 14: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/14.jpg)
NSLab, RIIT, Tsinghua Univ
Applications based on MR
Apache Pig� High Level Programming Language
Apache Hive� SQL Operation on HDFS
Apache Flume� Transfer Continuous Log or Event Data
Apache Sqoop� Transfer Data from RDBMS
ETL� Extract� Transform� Load
14
![Page 15: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/15.jpg)
NSLab, RIIT, Tsinghua Univ
Other Data Warehouse Tools
� Spark SQL/Shark (Based on Spark)� Dremel (Google)� Drill (Apache)� Impala (Cloudera)…
15
![Page 16: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/16.jpg)
NSLab, RIIT, Tsinghua Univ
In-Memory Processing
� Speed� Capacity & Cost� Fault Tolerance
100 NS
CPU
Core Core
L1 Cache L1 Cache
L2 Cache L2 Cache
L3 Cache
Main Memory
Disk
0.5 NS
7.0 NS
15.0 NS
SSD: 150K NSHD: 10M NS
16
![Page 17: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/17.jpg)
NSLab, RIIT, Tsinghua Univ
In-Memory Processing
How To Share Memory?� DSM (Distributed Shared Memory)� Distributed Key-Value Stores
High Cost for Fault Tolerance!
17
![Page 18: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/18.jpg)
NSLab, RIIT, Tsinghua Univ
Spark
RDDs (Resilient Distributed Datasets)� Parallel Actions Only (Map, Filter, Join, etc.)� Rebuild by Action Logs
18
![Page 19: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/19.jpg)
NSLab, RIIT, Tsinghua Univ
Spark
Tradeoff of Spark
19
![Page 20: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/20.jpg)
NSLab, RIIT, Tsinghua Univ
Stream Processing
Streaming Data� Volume� Velocity
F(X +ΔX)=F(X)op H(ΔX)
20
![Page 21: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/21.jpg)
NSLab, RIIT, Tsinghua Univ
Stream ProcessingRequirements of Stream Processing� Realtime� Fault Tolerance (Data/System)
• Stream (K-V Tuple)• Spout• Bolt• Topology
21
![Page 22: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/22.jpg)
NSLab, RIIT, Tsinghua Univ
NoSQL Database
RDBMS vs NoSQLBig Data� Variety� Sparse
� Key-Value� Column-Oriented� Document-Oriented
22
![Page 23: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/23.jpg)
NSLab, RIIT, Tsinghua Univ
NoSQL Database
RDBMS� Atomicity� Consistency� Isolation� Durability
NoSQL� Basically Available� Soft-state� Eventually Consistent
CAP� Consistency� Availability� Partition Tolerances
23
![Page 24: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/24.jpg)
NSLab, RIIT, Tsinghua Univ
NoSQL Database
� HBase� Cassandra� MongoDB� Accumulo� Redis…
� Tradeoff between consistency and availability.� Weak with complex SQL operations.
24
![Page 25: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/25.jpg)
NSLab, RIIT, Tsinghua Univ
ELK
Elasticsearch� Real-Time Full-Text Search
Logstash� Data Collect, Transform and Transport
Kibana� Analysis and Visualization
25
![Page 26: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/26.jpg)
NSLab, RIIT, Tsinghua Univ
Graph DatabaseOrganic Growth -> Scale Free
26
![Page 27: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/27.jpg)
NSLab, RIIT, Tsinghua Univ
Graph Database
Social Recommendation
Graph Query Examples
MATCH (person:Person)-[:IS_FRIEND_OF]->(friend), (friend)-[:LIKES]->(restaurant), (restaurant)-[:LOCATED_IN]->(loc:Location),(restaurant)-[:SERVES]->(type:Cuisine)WHERE person.name = ‘Philip’ AND loc.location=‘New York’ AND type.cuisine=‘Sushi’RETURN restaurant.name
27
![Page 28: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/28.jpg)
NSLab, RIIT, Tsinghua Univ
Graph Analysis
• Page Rank• Triangle Counting• Connected Components• Shortest Distance• Random Walk• Graph Coarsening• Graph Coloring• Minimum Spanning Forest• Community Detection• Collaborative Filtering• Belief Propagation• Named Entity Recognition…
28
![Page 29: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/29.jpg)
NSLab, RIIT, Tsinghua Univ
Large-Scale Machine LearningTasks–Classification [Predictive]–Clustering [Descriptive]–Association Rule Discovery [Descriptive]–Sequential Pattern Discovery [Descriptive]–Regression [Predictive]–Deviation Detection [Predictive]…
29
![Page 30: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/30.jpg)
NSLab, RIIT, Tsinghua Univ
Large-Scale Machine LearningFor Specific Algorithms:� YahooLDA (Latent Dirichlet Allocation)� Caffe (Convolutional Neural Network)� Torch → TensorFlow (Tensor Mathematics)…
General Platform� Mahout� Spark MLlib� DMLC @ CMU� Petuum @ CMU� DMTK @ MSRA…
30
![Page 31: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/31.jpg)
NSLab, RIIT, Tsinghua Univ
Visualization
•Prefuse•Google Refine•Tableau•R•Processing•D3 (JS)•ColorBrewer…
31
![Page 32: Introduction of Big Data Tools · using on-hand database management tools or traditional data processing applications. IDC: Big data technologies describe a new generation of technologies](https://reader033.vdocuments.site/reader033/viewer/2022050521/5fa4c33a0f8bb14007757003/html5/thumbnails/32.jpg)
NSLab, RIIT, Tsinghua Univ
Thank you!
Questions?
32