distributed data management - tu...

63
Distributed Data Management Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/

Upload: others

Post on 21-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Distributed Data ManagementSummer Semester 2015

TU Kaiserslautern

Prof. Dr.-Ing. Sebastian Michel

Databases and Information Systems Group (AG DBIS)

http://dbis.informatik.uni-kl.de/

Page 2: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

The Big Data Era: Some Numbers

• Google: 15 000 PB (=15 Exabytes)

• Facebook: 300 PB

• Ebay: 90 PB

• Spotify: 10 PB

Distributed Data Management, SoSe 2015, S. Michel 2

• Google: 100 PB• Ebay: 100 PB• NSA: 29 PB• Facebook: 600 TB• Twitter: 100 TB• Spotify: 2,2 TB

Estimated Size of Data

Data Processed per DayMB = 106 BytesGB = 109 BytesTB (Terabyte) = 1012 BytesPB (Petabyte) = 1015 BytesEB (Exabyte) = 1018 Bytes

Page 3: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

How does Data Look Like?

• Not necessarily like you got used to in database lectures: usually not nicely structured (BCNF or 3NF) relations with known schema information.

• But:

– Twitter Tweets

– Server Access Logs

– Web Pages

– Web Graph

– Huge CSV files in general (e.g., holding a “relation”)

Distributed Data Management, SoSe 2015, S. Michel 3

Page 4: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

{"created_at":"Wed Jan 21 15:21:04 +0000 2015","id":557920823764586496,"id_str":"557920823764586496","text":"#T ulsaAirport #Oklahoma Jan 21 08:53 Temperature 37\u00b0F clouds Wind NW 7 km\/h Humidity 85% .. http:\/\/t.co\ /SnC8ST3gQC","source":"\u003ca href=\"http:\/\/www.woweather.com\/USA\/TulsaIAP.htm\" rel=\"nofollow\"\u003eupd ate weather tulsa\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":nu ll,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":255167 921,"id_str":"255167921","name":"Weather Tulsa","screen_name":"wo_tulsa","location":"Tulsa","url":"http:\/\/itunes.apple.com\/app\/weatheronline\/id299504833?mt=8","description":"Weather Tulsa\n\nhttp:\/\/www.woweather.com \/USA\/Tulsa.htm","protected":false,"verified":false,"followers_count":111,"friends_count":60,"listed_count":5, "favourites_count":0,"statuses_count":33805,"created_at":"Sun Feb 20 20:31:42 +0000 2011","utc_offset":7200,"ti me_zone":"Athens","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_b ackground_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.pn g","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_ color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\ /pbs.twimg.com\/profile_images\/1249942071\/WO-20px-linien_normal.png","profile_image_url_https":"https:\/\/pbs .twimg.com\/profile_images\/1249942071\/WO-20px-linien_normal.png","default_profile":true,"default_profile_imag e":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"TulsaAirport", "indices":[0,13]},{"text":"Oklahoma","indices":[14,23]}],"trends":[],"urls":[{"url":"http:\/\/t.co\/SnC8ST3gQC","expanded_url":"http:\/\/bit.ly\/188eNcw","display_url":"bit.ly\/188eNcw","indices":[93,115]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"e n","timestamp_ms":"1421853664710"} {"created_at":"Wed Jan 21 15:21:04 +0000 2015","id":557920823877464064,"id_str":"557920823877464064","text":"An ime episode updated: Kyoukai no Kanata: Mini Theater # 6 ( http:\/\/t.co\/kjEPWveEHM ) #MalUpdater","source":"\ u003ca href=\"http:\/\/www.malupdater.com\" rel=\"nofollow\"\u003eMal Updater\u003c\/a\u003e","truncated":false ,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_ str":null,"in_reply_to_screen_name":null,"user":{"id":1049083842,"id_str":"1049083842","name":"OriginGenesis",

Distributed Data Management, SoSe 2015, S. Michel 4

Page 5: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Big Data

Distributed Data Management, SoSe 2015, S. Michel 5

Page 6: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

The BIG Data Challenge: The 4 Vs

• Volume

– Lots of data

• Velocity

– Changing / growing data

• Variety

– Heterogeneity

• Verity

– True or not?

Distributed Data Management, SoSe 2013, S. Michel 6

Addressed in this lecture

According to Gartner and others.

Page 7: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Showcase: Critical Volume

• Assume you got 10 TB data on disk

• Now, do some analysis of it

• With a 100MB/s disk, reading alone takes

– 100000 seconds

– 1666 minutes

– 27 hours

Distributed Data Management, SoSe 2015, S. Michel 7

Page 8: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Need to do something about it

Distributed Data Management, SoSe 2015, S. Michel 8http://flickr.com/photos/jurvetson/157722937/

http://www.google.com/about/datacenter

Page 9: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Scale-out

• Many machines (hundreds, thousands)

• As opposed to scale-up, where one very powerful (single) server is used

Distributed Data Management, SoSe 2015, S. Michel 9

Page 10: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Data Centers

Distributed Data Management, SoSe 2015, S. Michel 10

source: http://www.google.com/about/datacenters/inside/index.html

Page 11: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Hardware Failures• Lots of machines (commodity hardware)

failure is not an exception but very common

• P[machine fails today] = 1/365• n machines: P[failure of at least 1 machine] =

1-(1-P[machine fails today])^n

– for n=1: 0.0027– for n=10: 0.02706– for n=100: 0.239– for n=1000: 0.9356– for n=10 000: ~ 1.0

Distributed Data Management, SoSe 2015, S. Michel 11

source: google.com

Page 12: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Fallacies of Distributed Computing

1. The network is reliable

2. Latency is zero

3. Bandwidth is infinite

4. The network is secure

5. Topology doesn't change

6. There is one administrator

7. Transport cost is zero

8. The network is homogeneous

Distributed Data Management, SoSe 2015, S. Michel 12

source: Peter Deutschand others at Sun

Page 13: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Failure Handling & Recovery

• Hardware failures happen virtually at any time

• Algorithms/Infrastructures have to compensatethat

• Replication of data, logging of state, also redundancy in task execution

Distributed Data Management, SoSe 2015, S. Michel 13

Page 14: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Cost Numbers (=>Complex Cost Model)• L1 cache reference 0.5 ns

• L2 cache reference 7 ns

• Main memory reference 100 ns

• Compress 1K bytes with Zippy 10,000 ns

• Send 2K bytes over 1 Gbps network 20,000 ns

• Read 1 MB sequentially from memory 250,000 ns

• Round trip within same datacenter 500,000 ns

• Disk seek 10,000,000 ns

• Read 1 MB sequentially from network 10,000,000 ns

• Read 1 MB sequentially from disk 30,000,000 ns

• Send packet CA->Netherlands->CA 150,000,000 ns

Distributed Data Management, SoSe 2015, S. Michel 14

Numbers source: Jeff Dean

1ns = 10-6 ms

Page 15: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

What you will learn in this Lecture

• Most of the lecture is on processing big data– Map Reduce, NoSQL, Cloud computing

• Will operate on state of the art research results and tools

• Middle way between pure systems/tools discussion and learning how to build algorithms on top of them (see Joins over MR, n-grams, etc.)

• But also basic, fundamental techniques, like consistent hashing, PageRank, Bloom filters

Distributed Data Management, SoSe 2015, S. Michel 15

Page 16: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Lecture Contents (Tentative)

• MapReduce

– Fundamentals

– Various algorithms on top of it

• NoSQL approaches

– E.g., Key/Value Stores

– And techniques/theory behind them (e.g., CAP theorem, BASE)

• (Distributed) Data Stream Processing

• Cloud Computing and Big Data in general

Distributed Data Management, SoSe 2015, S. Michel 16

Page 17: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Prerequisites

• Successfully attended the information systems lecture or a similar database lectures.

• And knowledge in standard math/cs stuff, e.g., probability theory and Java/C++ coding.

• Work with systems/tools requires will to dive into APIs and installation procedures

Distributed Data Management, SoSe 2015, S. Michel 17

Page 18: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

People

• Lecturer:

– Prof. Sebastian Michel

– smichel (at) cs.uni-kl.de

• Teaching Assistants:

– MSc. Evica Milchevski and MSc. Kiril Panev

– milchevski (at) panev (at) cs.uni-kl.de

Distributed Data Management, SoSe 2015, S. Michel 18

http://dbis.informatik.uni-kl.de/

Page 19: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Organization & Regulations

Lecture:

Thursday

15:30 – 17:00

Room 42-110 (with at least one exception)

Exercise:

Tuesday (bi-weekly)

15:30 - 17:00

Room 46-210 (again, with one exception)

First session: April 28

Distributed Data Management, SoSe 2015, S. Michel 19

Page 20: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Lecture Organization

• Pretty new Lecture

• On topics that are often brand new.

• Later topics are still tentative.

• Please provide feedback. E.g., too slow / too fast? Important topics you want to have covered?

Distributed Data Management, SoSe 2015, S. Michel 20

Page 21: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Exercises

• Assignment sheet, every two weeks

• Mixture of:– Practical: Implementation (e.g., Map Reduce)

– Practical: Algorithms on “paper”

– Theory: Where appropriate (show that …)

– Brief Essay: Explain the difference of x and y (short summary)

• Need to successfully participate to be admitted to final exam

• Regulations on next slides

Distributed Data Management, SoSe 2015, S. Michel 21

Page 22: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Regulations for Admission to Exam

• Successful participation in exercise sessions

• There will be 6 exercise sheets

• Each comprises 3 mandatory assignments

• No handing in of solutions, instead:

– Tutor asks at beginning of TA session to mark on a sheet the assignments you have solved and can present

Distributed Data Management, SoSe 2015, S. Michel 22

Name Assignment 1 Assignment 2 Assignment 3

John Doe

Britney Clinton

….

Page 23: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Regulations for Admission to Exam (2)

• Each mark is equivalent to one point

• You need to obtain 13 points throughout the semester to get admitted to the exam

• Full point is given if solution is correct or close to it

• Zero points is given if assignment has proven incorrect to large extent

• Zero points on entire sheet will be given in case you marked an assignment solved but it is obvious you didn’t really do it (->cheating)

Distributed Data Management, SoSe 2015, S. Michel 23

Page 24: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Exam

• Written or oral exam at the end of teaching period in semester (last week or week thereafter)

• Everything mentioned in lecture or exercises is relevant for exam. Unless explicitly stated.

• We assume you actively participated in the exercises to be prepared.

Distributed Data Management, SoSe 2015, S. Michel 24

Page 25: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Note in Credit Points and Work

• Lecture is worth 4 ECTS points

• Each point is assumed to describe 30 hours ofwork

• 4 x 30h = 120h

• 14 weeks, makes around 9h of work each week

Distributed Data Management, SoSe 2015, S. Michel 25

Page 26: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Registration

• If not done already, please register through the KIS system

• Registration is closing on May 10, 2015

• Without registration, no marks in TA session possible, hence, no exam qualification.

Distributed Data Management, SoSe 2015, S. Michel 26

Page 27: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Note on Amazon Grant• We are grateful for having obtained a grant from

Amazon for using (some) of their web services (AWS)

Distributed Data Management, SoSe 2015, S. Michel 27

http://aws.amazon.com

• $100 credit for your AWS account.

• To get an AWS account; register with credit card.

• Send us email with your registered email address.

• First come first serve (as amount of vouchers is limited).

Page 28: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Note on Amazon Grant (2)

• Due to restrictzions in terms of #vouchers and availabity of credit cards, there won‘t be any mandatory assignment on AWS.

• Just see it as a possibility to get to know AWS if you want to (and have a credit card).

• Local installations of relevant tools like MapReduceand NoSQL stores is anyway possible.

• Check out this virtual machine with lots of stuff already installed: http://hortonworks.com/hdp/downloads/

Distributed Data Management, SoSe 2015, S. Michel 28

Page 29: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Literature (Books)

• Pramodkumar J. Sadalage, Martin Fowler. NoSQLDistilled. Addison Wesley, 2012.

• Eric Redmond, Jim R. Wilson. Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement.

• Stefan Endlich et al. NoSQL: Einstieg in die Welt nichtrelationaler Web 2.0 Datenbanken. Carl Hanser Verlag, 2011. (in German)

• Tom White. Hadoop: The Definitive Guide. O’Reilly, 2012.

Distributed Data Management, SoSe 2015, S. Michel 29

Page 30: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Literature (Books) (Cont’d)

Books on standard database topics

• R. Elmasri, S. B. Navathe. Fundamentals of Database Systems. Addison Wesley, 2006.

• R. Ramakrishnan, J. Gehrke. Database Management Systems. Mcgraw-Hill, 2002.

• H. Garcia-Molina, J. D. Ullman, J. Widom. Database Systems: The Complete Book. Prentice Hall, 2008.

Distributed Data Management, SoSe 2015, S. Michel 30

Page 31: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Specific Literature

• Specific literature will be given throughout the lecture.

• Primarily by pointers to original research articles

Distributed Data Management, SoSe 2015, S. Michel 31

Page 32: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

MAPREDUCE (MR)

Distributed Data Management, SoSe 2015, S. Michel 32

Page 33: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

MR Motivation: Word Count

Distributed Data Management, SoSe 2015, S. Michel 33

The Elwedritsch is a cryptid or mythical creature that supposedly inhabits the Palatinate of Germany. It is described as being a chicken-like creature with antlers. It also has scales instead of feathers. However, it is said that their wings are of little use. That is why they live mainly in underbrush and under vines. Sometimes Elwetritschen are depicted with antlers of a stag and their beaks often appear to be very long. In the second half of the 20th century, artists increasingly portrayed Elwetritschen as female by adding breasts. Elwetritschensupposedly originate from crossbreeding chickens, ducks, and geese with mythical wood creatures such as goblins and elves. Being a fowl, they naturally lay eggs, which as a result of descending from forest spirits, grow during breeding season. Eggs in various sizes are artistically depicted at the Elwetritschenbrunnen in Neustadtan der Weinstraße. Geographical Distribution: The area in which tales of the Elwetritsch are spread expands from the Palatinate Forest in the west of Germany towards the east across the Upper Rhine Plain to the southern parts of the Odenwald. The mythical creature also appears in the north of Baden-Württemberg. In the Main-Tauber-Kreis, where they are known as “Ilwedridsche”, the children are told that at night the creatures sleep in the crowns of the willow trees standing next to the river Tauber. In Neustadt an der Weinstraße, which is said to be the “capital” of the Elwetritsches, there is an Elwetritsche-fountain, created by Gernot Rumpf. Other sources consider Dahn in the southwestern Palatinate, which also has an Elwetritsche-fountain, Erfweiler or other villages as secret capitals of these creatures. The idea is very similar to the "snipe hunt." The Elwetritsch is supposedly very shy, but also very curious. A hunting party consists of a "Fänger" (catcher), equipped with a big potato sack and a lantern, and the "Treiber" (beaters). The catcher is led into the woods where the Elwetritsch is supposed to live, instructed to wait in a clearing with his sack and lantern, while the beaters go off, supposedly to flush out the Elwetritsch. The light of the lantern is said to be attractive to the curious creature, so it will come to investigate and will then be caught by the catcher. While he waits, everyone heads back to the pub or wherever the party had previously assembled, to wait for the catcher to realize he has been fooled

Imagine this file is several TB or PB in size!

Page 34: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

MR Motivation: Word Count

Distributed Data Management, SoSe 2015, S. Michel 34

The Elwedritsch is a cryptid or mythical creature that supposedly inhabits the Palatinate of Germany. It is described as being a chicken-like creature with antlers. It also has scales instead of feathers. However, it is said that their wings are of little use. That is why they live mainly in underbrush and under vines. Sometimes Elwetritschen are depicted with antlers of a stag and their beaks often appear to be very long. In the second half of the 20th century, artists increasingly portrayed Elwetritschen as female by adding breasts. Elwetritschensupposedly originate from crossbreeding chickens, ducks, and geese with mythical wood creatures such as goblins and elves. Being a fowl, they naturally lay eggs, which as a result of descending from forest spirits, grow during breeding season. Eggs in various sizes are artistically depicted at the Elwetritschenbrunnen in Neustadtan der Weinstraße. Geographical Distribution: The area in which tales of the Elwetritsch are spread expands from the Palatinate Forest in the west of Germany towards the east across the Upper Rhine Plain to the southern parts of the Odenwald. The mythical creature also appears in the north of Baden-Württemberg. In the Main-Tauber-Kreis, where they are known as “Ilwedridsche”, the children are told that at night the creatures sleep in the crowns of the willow trees standing next to the river Tauber. In Neustadt an der Weinstraße, which is said to be the “capital” of the Elwetritsches, there is an Elwetritsche-fountain, created by Gernot Rumpf. Other sources consider Dahn in the southwestern Palatinate, which also has an Elwetritsche-fountain, Erfweiler or other villages as secret capitals of these creatures. The idea is very similar to the "snipe hunt." The Elwetritsch is supposedly very shy, but also very curious. A hunting party consists of a "Fänger" (catcher), equipped with a big potato sack and a lantern, and the "Treiber" (beaters). The catcher is led into the woods where the Elwetritsch is supposed to live, instructed to wait in a clearing with his sack and lantern, while the beaters go off, supposedly to flush out the Elwetritsch. The light of the lantern is said to be attractive to the curious creature, so it will come to investigate and will then be caught by the catcher. While he waits, everyone heads back to the pub or wherever the party had previously assembled, to wait for the catcher to realize he has been fooled

Imagine this file is several TB or PB in size but chunked-up and spread accross many machines!

Page 35: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

MR: Scale-out Architecture

• Many machines (hundreds, thousands)

• Data is spread across machines

• Processing tasks initiated (ideally) where data resides

Distributed Data Management, SoSe 2015, S. Michel 35

Page 36: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Distributed Data Management, SoSe 2015, S. Michel 36

Screenshot of HDFS (Hadoop/MR) FilesystemUI. Showing info on a large file of Twitter tweets/updates, stored in 509 blocks (chunks) over several machines

Page 37: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Map and Reduce: Key Idea

• Spread task of processing data on machines

• According to map and reduce rules/functions

• No need to deal with node failures, load balancing, etc. system takes care of this.

• Map phase: Data is put to a number of machines. Output is partitioned (grouped) by a key (e.g., a term)

• Reduce: For each key-group, data is aggregated (reduced)

Distributed Data Management, SoSe 2015, S. Michel 37

Page 38: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Map Reduce from High Level

Distributed Data Management, SoSe 2015, S. Michel 38

Intermediate Results

D MAP REDUCE

T

A

A

MAP

MAP

MAP

REDUCE

REDUCE

Result

Result

Result

Page 39: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Brief History of MapReduce

• First described in an article in 2004.

– MapReduce paradigm and how it is used in Google(Google file system, etc.)

– Paper by J. Dean and S. Ghemawat in 2004.

• Many MapReduce implementations

• Hadoop is arguable the most prominent one

• Will look at MR in general and Hadoopspecifically

Distributed Data Management, SoSe 2015, S. Michel 39

Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150

Page 40: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Architectural Issues

• Data lies in a distributed file system

• Block based, big chunks (usually 64MB or 128MB)

• Chunks are replicated and distributed over machines

• If possible, data processing is moved to data hosting machines.

Distributed Data Management, SoSe 2015, S. Michel 40

Page 41: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Functional Programming: Map

Expression: map

Of type: (a -> b) -> [a] -> [b]

Definition:

• map f [] = []

• map f (x:xs) = f x : map f xs

Example (using Hugs98 Haskell):

• map (\x-> x*x) [1,2,3,4]

Distributed Data Management, SoSe 2015, S. Michel 41

[1,4,9,16]

f

f

f

f

f

f

f

Page 42: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Map

Observation:

Execution of function f can be done fully in parallel!

Then: Output is aggregated (reduced).

Distributed Data Management, SoSe 2015, S. Michel 42

Page 43: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Functional Programming: Reduce (aka. fold)

Expression : foldl (note: there is also foldr=right)

Of type : (a -> b -> a) -> a -> [b] -> a

Definition:

• foldl f z [] = z

• foldl f z (x:xs) = foldl f (f z x) xs

Example:

• foldl (+) 0 [1,2,3,4,5]

Distributed Data Management, SoSe 2015, S. Michel 43

15

Page 44: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Note on “Functional Programming”

• What was commonly restricted to functionalprog. languages is getting more and more“standard”

• Python, Ruby, Scala (Java++), Clojure, C#, C++(11)

• Example, in Ruby:

[1,2,3,4,5].map{|x| x**2 } => [1, 4, 9, 16, 25]

[1,2,3,4,5].inject(0){|x,a| x+a} => 15

Distributed Data Management, SoSe 2015, S. Michel 44

Page 45: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Going Distributed: Key Principle

• Many data chunks

• Map function on each of the chunks

• Map process outputs data with keys

=> Partitions based on keys

• Aggregate (fold/reduce) mapped data per key

• E.g., count number occurrences of each terms in set of documents.

Distributed Data Management, SoSe 2015, S. Michel 45

Page 46: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Map Reduce from High Level

Distributed Data Management, SoSe 2015, S. Michel 46

D MAP REDUCE

T

A

A

MAP

MAP

MAP

REDUCE

REDUCE

Result

Result

Result

Intermediate Results

Page 47: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Map and Reduce: Types

• Map (k1,v1) list(k2,v2)

• Reduce (k2, list(v2)) list(k3, v3)

• For instance:

– k1= document identifier

– v1= document content

– k2= term

– v2=count

Distributed Data Management, SoSe 2015, S. Michel 47

– k3= term

– v3= final count

keys allow grouping data to

machines/tasks

Page 48: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Move Computation to Data

• Data is stored in a distributed file system (for Google: GFS=Google File System)

• Large chunks (blocks)

• Master node of GFS knows locations

• Can/should! initiate computation at such nodes

Distributed Data Management, SoSe 2015, S. Michel 48

block

node

Page 49: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Computation (Workflow)

• A master node controls computation

– this is where you submit your job (task) to

– computes necessary map and reduce tasks

– selects and activates worker nodes

• Worker node

– for map; selected if possible close to data

– reduce; consumed intermediate results and creates final output

Distributed Data Management, SoSe 2015, S. Michel 49

Page 50: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Example: Grep

• Given: file

• Want: all lines that contain certain pattern

• Map(String key, String value)

if value.contains(pattern):

emit(value, “”)

This is a map only task (no reducer; no grouping by key): output is written directly to distributed file system

Distributed Data Management, SoSe 2015, S. Michel 50

Page 51: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

MapReduce: Example Map + Count

• Data Part 1

– “One ring to rule them all, one ring to find them,

• Data Part 2

– “One ring to bring them all and in the darkness bind them.”

Distributed Data Management, SoSe 2015, S. Michel 51

Page 52: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Map Line to Terms and Counts

{"one"=>["1", "1"],

"ring"=>["1", "1"],

"to"=>["1", "1"],

"rule"=>["1"],

"them"=>["1", "1"],

"all"=>["1"],

"find"=>["1"]}

Distributed Data Management, SoSe 2015, S. Michel 52

{"one"=>["1"],"ring"=>["1"],"to"=>["1"],"bring"=>["1"],"them"=>["1", "1"],"all"=>["1"],"and"=>["1"],"in"=>["1"],"the"=>["1"],"darkness"=>["1"],"bind"=>["1"]}

Line 1

Line 2

Page 53: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Group by Term

Distributed Data Management, SoSe 2015, S. Michel 53

{"one"=>["1", "1"],

"ring"=>["1", "1"],

….

{"one"=>["1"],"ring"=>["1"],

{"one"=>[["1”,”1”],[“1”]],"ring"=>[["1”,”1”],[“1”]],

Page 54: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Sum Up

Distributed Data Management, SoSe 2015, S. Michel 54

{"one"=>[["1”,”1”],[“1”]],"ring"=>[["1”,”1”],[“1”]],

{"one"=>[“3”],"ring"=>[“3”],

Page 55: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Example: Wordcount

Map(String key, String value)

for each word w in value:

emit(w, 1)

Reduce(String key, Iterator values)

int result=0

for each v in values:

result += v

emit(result)

Distributed Data Management, SoSe 2015, S. Michel 55

Note: depends also in which context you want to count, e.g.,

- overall occurrences of word in collection

-or number of documents in which word occurs

- or number of sentences in collection where word occurs

- …

Page 56: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Example: Inverted Index

• Given: set of documents

• Want: A -> list of document ids in which A occurs, for each term A

• How can this be computed in MapReduce?

Distributed Data Management, SoSe 2015, S. Michel 56

A D61 D12 D43 D49

Page 57: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Example: Inverted Index

• Why useful?

– Consider Google-style query: A B C

– How to find relevant documents? Parse through all? No.

– Which documents are relevant for the result? Check (pre-computed inv. index):

Distributed Data Management, SoSe 2015, S. Michel 57

A D61 D12 D43 D49

B D31 D52 D61 D49

C D43 D61 D98 D31

Page 58: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Example: Co-occurrences

• Given: text file

• Want: for terms a, b, how often does a and b occur close together, e.g., within sentence?

• That is, output = ([a,b], count)

• How can this be computed?

Distributed Data Management, SoSe 2015, S. Michel 58

Page 59: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Example: Co-occurrences (Cont’d)

• Solution 1: pairs approach– mapper for string s:

• for all term pairs (a,b) in s: emit({a,b}, 1)

– reducer just aggregates counts

• Solution 2: “stripes” approach– mapper for string s:

• collect all t_i that co-occur with a

• emit (a,{t_1, t_2, …. t_n})

– reducer aggregates

Distributed Data Management, SoSe 2015, S. Michel 59

Page 60: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Code: WordCount in Hadoop (Excerpt)

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());context.write(word, one);

}}}

Distributed Data Management, SoSe 2015, S. Michel 60

Page 61: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Code: WordCount in Hadoop (Excerpt)

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

Distributed Data Management, SoSe 2015, S. Michel 61

Source: http://wiki.apache.org/hadoop/WordCount

Page 62: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Additional Combiner

• Map phase might output large amounts of data that could be reduced already locally

• As network bandwidth is often limiting factor

• Works for functions like: max(1,2,6,2,1,9) = max(max(1,2,6), max(2,1,9))

• Add combiner to be run on map output.

• Usually, same as reducer (code)

• Not a replacement of reducer (as it sees only local information!)

Distributed Data Management, SoSe 2015, S. Michel 62

Page 63: Distributed Data Management - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture1.pdf · consistent hashing, PageRank, Bloom filters Distributed Data Management,

Combiner Caveats

• Note that some aggregates can’t be done locally.

– like: output if sum(value)>threshold. Why? Can’t decide that threshold crossing because it sees only local info.

• Note: this application makes still a good case for the combiner, but it should just sum up the local values and not “prune” based on threshold. So, it is different from the final reducer.

– if aggregation function is not associative “((x*y)*z=x*(y*z))” and commutative “(x*y=y*x)”

– also problematic: average (but can be fixed: reducer need to know also the number of items then)

Distributed Data Management, SoSe 2015, S. Michel 63