algorithms and systems on big data management · pdf filealgorithms and systems on big data...
TRANSCRIPT
![Page 1: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/1.jpg)
www.helsinki.fi
Algorithms and Systems
on big data management
Lecturer: Jiaheng Lu
Fall 2016
24.2.2017 1
![Page 2: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/2.jpg)
www.helsinki.fi 24.2.2017 2
Matemaattis-luonnontieteellinen tiedekunta /
We are in the era of big data
• Lots of data is being collected
• Web data, e-commerce
• Bank/Credit Card transactions
• Social Network
• Scientific data
![Page 3: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/3.jpg)
www.helsinki.fi
• Byte (B)
• Kilobyte (KB)
• Megabyte (MB)
• Gigabyte (GB)
• Terabyte (TB)
• Petabyte (PB)
• Exabyte (EB)
• Zettabyte (ZB)
• Yottabyte (YB)
24.2.2017 3
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Data sizes
![Page 4: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/4.jpg)
www.helsinki.fi
How much data?
• Google processes 100 PB a day, 3 millions servers
• Facebook has 300 PB of user data + 500 TB/day
• Youtube has 1000 PB video storage
• SMS messages 6.1 TB per year
• US Credit card: 1.4B cards, 20B transaction/year
• In 2009, total data is about 1ZB, in 2020, it is
estimated to be 35ZB.
![Page 5: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/5.jpg)
www.helsinki.fi
Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Image data, audio data, video data
• Graph Data
• Social Network, Semantic Web (RDF), …
• XML and JSON data
![Page 6: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/6.jpg)
www.helsinki.fi 24.2.2017 6
Four V’s
![Page 7: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/7.jpg)
www.helsinki.fi
• Watch two videos about big data
• What is big data
• https://www.youtube.com/watch?v=PlaJsseTgk4
• Explaining big data
• https://www.youtube.com/watch?v=dX7kit8jkjo
24.2.2017 7Matemaattis-luonnontieteellinen tiedekunta /
Henkilön nimi / Esityksen nimi
![Page 8: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/8.jpg)
www.helsinki.fi
• About the course
• Practical information and requirement
• Course topics
• Our schedule
24.2.2017 8Matemaattis-luonnontieteellinen tiedekunta /
Henkilön nimi / Esityksen nimi
Outline
![Page 9: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/9.jpg)
www.helsinki.fi
• You should be able to tell what these terms stand for!
And more…
24.2.2017 9Matemaattis-luonnontieteellinen tiedekunta /
Henkilön nimi / Esityksen nimi
At the end of the course
Hadoop MapreduceBigtable
![Page 10: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/10.jpg)
www.helsinki.fi
• Students are expected to
• Select a data model to suit the characteristics
of your data;
• Understand sketch techniques to handle
streaming data, including Count-Min, Count-
Sketch and FM Sketch;
• Understand GFS, MapReduce and Bigtable
technology
• Hands-on experience on Hadoop MapReduce
24.2.2017 10Matemaattis-luonnontieteellinen tiedekunta /
Henkilön nimi / Esityksen nimi
After this course
![Page 11: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/11.jpg)
www.helsinki.fi 24.2.2017 11Matemaattis-luonnontieteellinen tiedekunta /
Henkilön nimi / Esityksen nimi
Workflow of this course
Exercise
1
Exercise
2Exercise 3
(one programming task)
Study group
1
Study group
2
13 December29 November15 November
Final
Examination
9 November 23 November
![Page 12: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/12.jpg)
www.helsinki.fi
• Introduction to big data and data models
(2 weeks)
• NoSQL databases (1 week)
• Sketches algorithms (2 weeks)
• GFS, Mapreduce and Bigtable (2 weeks)
24.2.2017 12Matemaattis-luonnontieteellinen tiedekunta /
Henkilön nimi / Esityksen nimi
Topics of this course
![Page 13: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/13.jpg)
www.helsinki.fi
• Five steps in Big data engineering
• 1. Acquire data
• 2. Prepare data
• 3. Analyze data
• 4. Report data
• 5. Act
24.2.2017 13
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Introduction to big data engineering
![Page 14: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/14.jpg)
www.helsinki.fi
• Identify data set
• Retrieve data or buy data
24.2.2017 14
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Step 1 Acquire data
![Page 15: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/15.jpg)
www.helsinki.fi
• 2.1 Explore data
• Understand the nature of data
• Preliminary analysis
• 2.2 Preprocess data
• Clean, integrate and package
24.2.2017 15
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Step 2. Prepare data
![Page 16: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/16.jpg)
www.helsinki.fi
• Select analytical techniques
• Build model
24.2.2017 16
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Step 3. Analyze data
![Page 17: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/17.jpg)
www.helsinki.fi
• Visualization and summary
24.2.2017 17
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Step 4. Communication results
![Page 18: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/18.jpg)
www.helsinki.fi
• The above five steps are iterative process
24.2.2017 18
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Step 5. Apply results
![Page 19: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/19.jpg)
www.helsinki.fi
• 1. Enable scalability
• Commodity hardware is cheap
• 2. Handle fault tolerance
• Be ready, crashes happen
• 3. Optimize performance
• 4. Provide values
24.2.2017 19
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Key technical challenges for big
data
![Page 20: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/20.jpg)
www.helsinki.fi
What is Hadoop?
• Apache top level project, open-source
implementation of frameworks for reliable, scalable,
distributed computing and data storage.
![Page 21: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/21.jpg)
www.helsinki.fi
Hadoop’s Developers
2005: Doug Cutting and Michael J. Cafarella developed Hadoop
to support distribution for the Nutch search engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache Software Foundation.
![Page 22: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/22.jpg)
www.helsinki.fi
Google Origins
2003
2004
2006
![Page 23: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/23.jpg)
www.helsinki.fi
Some Hadoop Milestones
• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte
of data in 209 seconds, compared to previous record of 297 seconds)
• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed,
adding more computational power to Hadoop framework
• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.
- Ambari, Cassandra, Mahout have been added
• 2016 - Hadoop 3.0.0 Alpha-1
•
![Page 24: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/24.jpg)
www.helsinki.fi 24.2.2017 24
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
![Page 25: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/25.jpg)
www.helsinki.fi 24.2.2017 25
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Hadoop File System
Hadoop File
System was
developed using
distributed file
system design. It is
run on commodity
hardware. Unlike
other distributed
systems, HDFS is
highly fault tolerant
and designed using
low-cost hardware.
![Page 26: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/26.jpg)
www.helsinki.fi 24.2.2017 26
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
YARN
YARN is the
architectural center
of Hadoop that
allows multiple data
processing engines
such as interactive
SQL, real-time
streaming, data
science and batch
processing to
handle data stored
in a single platform.
![Page 27: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/27.jpg)
www.helsinki.fi 24.2.2017 27
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Apache HBase™
Apache HBase™ is
the Hadoop
database, a
distributed, scalable,
big data store.
![Page 28: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/28.jpg)
www.helsinki.fi 24.2.2017 28
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Apache Mahout™
The Apache
Mahout™ project's
goal is to build an
environment for
quickly creating
scalable performant
machine learning
applications.
![Page 29: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/29.jpg)
www.helsinki.fi 24.2.2017 29
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Apache Pig
Apache Pig is a
platform for analyzing
large data sets that
consists of a high-
level language for
expressing data
analysis programs,
coupled with
infrastructure for
evaluating these
programs.
![Page 30: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/30.jpg)
www.helsinki.fi 24.2.2017 30
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Apache Hive
The Apache Hive ™
data warehouse
software facilitates
reading, writing, and
managing large
datasets residing in
distributed storage
using SQL.
![Page 31: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/31.jpg)
www.helsinki.fi 24.2.2017 31
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Apache Spark
Apache Spark™ is a
fast and general
engine for large-scale
data processing.
![Page 32: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/32.jpg)
www.helsinki.fi 24.2.2017 32
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Oozie
Oozie is a workflow
scheduler system to
manage Apache
Hadoop jobs.
Oozie Workflow jobs
are Directed Acyclical
Graphs (DAGs) of
actions.
![Page 33: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/33.jpg)
www.helsinki.fi 24.2.2017 33
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
ZooKeeper
ZooKeeper is a
centralized service
for maintaining
configuration
information, naming,
providing distributed
synchronization, and
providing group
services.
![Page 34: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/34.jpg)
www.helsinki.fi 24.2.2017 34
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Apache Sqoop
Apache Sqoop(TM)
is a tool designed for
efficiently transferring
bulk data between
Apache Hadoop and
structured datastores
such as relational
databases.
![Page 35: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/35.jpg)
www.helsinki.fi 24.2.2017 35
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Apache Storm
Apache™ Storm adds
reliable real-time data
processing capabilities
to Enterprise Hadoop.
Storm on YARN is
powerful for scenarios
requiring real-time
analytics, machine
learning and
continuous monitoring
of operations.
![Page 36: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/36.jpg)
www.helsinki.fi 24.2.2017 36
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Apache Ambari
The Apache Ambari
project is aimed at
making Hadoop
management simpler by
developing software for
provisioning, managing,
and monitoring Apache
Hadoop clusters.
![Page 37: Algorithms and Systems on big data management · PDF fileAlgorithms and Systems on big data management Lecturer: ... (ZB) • Yottabyte (YB) ... • There are five steps in big data](https://reader031.vdocuments.site/reader031/viewer/2022030510/5aba28107f8b9a441d8b533b/html5/thumbnails/37.jpg)
www.helsinki.fi
• We are in the era of big data
• 4Vs of Big data: Volume, Variety, Velocity and
Veracity
• There are five steps in big data engineering,
including Acquire data, Prepare data, Analyze data,
Report data and Act
• Hadoop is an open-source implementation of
frameworks for reliable, scalable, distributed
computing and big data storage.
24.2.2017 37
Matemaattis-luonnontieteellinen tiedekunta /
Iso tiedonhallinta/
Jiaheng Lu
Wrap-up