1introduction - tu kaiserslautern · 4/17/13 5 groupbyterm...
TRANSCRIPT
4/17/13
1
Distributed Data Management Summer Semester 2013
TU Kaiserslautern
Dr.-‐Ing. Sebas9an Michel
[email protected]‐saarland.de
Distributed Data Management, SoSe 2013, S. Michel 1
MOTIVATION AND OVERVIEW Lecture 1
Distributed Data Management, SoSe 2013, S. Michel 2
Distributed Data Management
• What does “distributed” mean?
• And why would we want/need to do things in a distributed way?
Distributed Data Management, SoSe 2013, S. Michel 3
Reason: Federated Data • Data is per se hosted at different sites
• Autonomy of sites • Maintained by diff. organiza9ons • Mashups over such independent sources
• Linked Open Data (LOD)
Distributed Data Management, SoSe 2013, S. Michel 4
Reason: Sensor Data
• Data originates at different sensors • Spread across the world • Health data from mobile devices
Distributed Data Management, SoSe 2013, S. Michel 5
Con$nuous queries!
Distributed Data Management, SoSe 2013, S. Michel 6
IP Bytes in kB
192.168.1.7 31kB
192.168.1.3 23kB
192.168.1.4 12kB
IP Bytes in kB
192.168.1.8 81kB
192.168.1.3 33kB
192.168.1.1 12kB
IP Bytes in kB
192.168.1.4 53kB
192.168.1.3 21kB
192.168.1.1 9kB
IP Bytes in kB
192.168.1.1 29kB
192.168.1.4 28kB
192.168.1.5 12kB
E.g. find clients that cause high network traffic.
Reason: Network Monitoring
4/17/13
2
Reason: Individuals as Providers/Consumers
• Don’t want single operator with global knowledge -‐> be^er decentralized?
• Distributed search engines • Data on mobile phones • Peer-‐to-‐Peer (P2P) systems • Distributed social networks • Leveraging idle resources
Distributed Data Management, SoSe 2013, S. Michel 7
Example: SETI@Home
• Distributed Compu9ng • Donate idle 9me of your personal computer
• Analyze extraterrestrial radio signals when screensaver is running
Distributed Data Management, SoSe 2013, S. Michel 8
Distributed Data Management, SoSe 2013, S. Michel 9
Example: P2P Systems: Napster
Publish
file sta
9s9cs
File Download
File Dow
nload
• Central server (index) • Client sofware sends informa9on about users‘ contents to server. • User send queries to server • Server responds with IP of users that store matching files. à Peer-‐to-‐Peer file sharing!
• Developed in 1998. • First P2P file-‐sharing system
Pirate-‐to-‐Pirate?
Example: Self Organiza9on & Message Flooding
Distributed Data Management, SoSe 2013, S. Michel 10
TTL 3
TTL 3
TTL 2
TTL 2
TTL 2 TTL 1
TTL 0 TTL 1
TTL 1
TTL 0
Example: Structured Overlay Networks
• Logarithmic cost with rou9ng tables (not shown here) • Self organizing
• Will see later twice: NoSQL KeyValue stores and P2P Systems
Distributed Data Management, SoSe 2013, S. Michel 11
p1
p8
p14
p21
p32 p38
p42
p48
p51
p56
k10
k24
k30 k38
k54
Reason: Size
Distributed Data Management, SoSe 2013, S. Michel 12
4/17/13
3
Showcase Scenario
• Assume you got 10 TB data on disk • Now, do some analysis of it
• With a 100MB/s disk, reading alone takes – 100000 seconds – 1666 minutes – 27 hours
Distributed Data Management, SoSe 2013, S. Michel 13
Huge Amounts of Data
• Google: – Billions of Websites (around 50 billion, Spring 2013) – TBs of data
• Twi^er: – 100s million tweets per day
• Cern’s LHC – 25 Petabytes of data per year
Distributed Data Management, SoSe 2013, S. Michel 14
Huge Amounts of Data(2)
• Megaupload – 28 PB of data
• AT&T (US Telecomm. Provider) – 30 PB of data through its networks each day
• Facebook – 100 PB Hadoop cluster
Distributed Data Management, SoSe 2013, S. Michel 15
h^p://en.wikipedia.org/wiki/Petabyte
Need to do something about it
Distributed Data Management, SoSe 2013, S. Michel 16 h^p://flickr.com/photos/jurvetson/157722937/
h^p://www.google.com/about/datacenter
Scale-‐Out vs. Scale-‐Up • Scale-‐Out (Many Servers-‐> Distributed)
• As opposed to Scale-‐Up
Distributed Data Management, SoSe 2013, S. Michel 17
Scale-‐Out • Common technique is scale-‐out
– Many machines – Amazon’s EC2 cloud, around 400, 000 machines
• Commodity machines (many but not individually super fast)
• Failures happen virtually at any 9me.
• Electricity is an issue (par9cularly for cooling)
Distributed Data Management, SoSe 2013, S. Michel 18
h^p://huanliu.wordpress.com/2012/03/13/amazon-‐data-‐center-‐size/
4/17/13
4
Hardware Failures • Lots of machines (commodity hardware) à failure is not excep9on but very common
• P[machine fails today] = 1/365 • n machines: P[failure of at least 1 machine] =
1-‐(1-‐P[machine fails today])^n – for n=1: 0. 0.0027 – for n=10: 0.02706 – for n=100: 0.239 – for n=1000: 0.9356 – for n=10 000: ~ 1.0
Distributed Data Management, SoSe 2013, S. Michel 19
Failure Handling & Recovery
• Hardware failures happen virtually at any 9me • Algorithms/Infrastructures have to compensate that
• Replica9on of data, logging of state, also redundancy in task execu9on
Distributed Data Management, SoSe 2013, S. Michel 20
Cost Numbers (=>Complex Cost Model) • L1 cache reference 0.5 ns • L2 cache reference 7 ns • Main memory reference 100 ns • Compress 1K bytes with Zippy 10,000 ns • Send 2K bytes over 1 Gbps network 20,000 ns • Read 1 MB sequen9ally from memory 250,000 ns • Round trip within same datacenter 500,000 ns • Disk seek 10,000,000 ns • Read 1 MB sequen9ally from network 10,000,000 ns • Read 1 MB sequen9ally from disk 30,000,000 ns • Send packet CA-‐>Netherlands-‐>CA 150,000,000 ns
Distributed Data Management, SoSe 2013, S. Michel 21 Numbers source: Jeff Dean
1ns = 10^-‐6 ms
Map Reduce • “Novel” compu9ng paradigm introduced by Google in 2004.
• Have many machines in a data center. • Don’t want to care about impl. details like data placement, failure handling, cost models.
• Abstract computa9on to two basic func9ons:
• Think “func9onal programming” with map and fold (reduce), but – Distributed and – Large scale
Distributed Data Management, SoSe 2013, S. Michel 22
Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-‐150
Map Reduce: Example Map + Count
• Line 1 – “One ring to rule them all, one ring to find them,
• Line 2 – “One ring to bring them all and in the darkness bind them.”
Distributed Data Management, SoSe 2013, S. Michel 23
Map Line to Terms and Counts
{"one"=>["1", "1"], "ring"=>["1", "1"], "to"=>["1", "1"], "rule"=>["1"], "them"=>["1", "1"], "all"=>["1"], "find"=>["1"]}
Distributed Data Management, SoSe 2013, S. Michel 24
{"one"=>["1"], "ring"=>["1"], "to"=>["1"], "bring"=>["1"], "them"=>["1", "1"], "all"=>["1"], "and"=>["1"], "in"=>["1"], "the"=>["1"], "darkness"=>["1"], "bind"=>["1"]}
Line 1
Line 2
4/17/13
5
Group by Term
Distributed Data Management, SoSe 2013, S. Michel 25
{"one"=>["1", "1"], "ring"=>["1", "1"], ….
{"one"=>["1"], "ring"=>["1"], …
{"one"=>[["1”,”1”],[“1”]], "ring"=>[["1”,”1”],[“1”]], …
Sum Up
Distributed Data Management, SoSe 2013, S. Michel 26
{"one"=>[["1”,”1”],[“1”]], "ring"=>[["1”,”1”],[“1”]], …
{"one"=>[“3”], "ring"=>[“3”], …
Applica9on: Compu9ng PageRank
• Link analysis model proposed by Brin&Page • Compute authority scores • In terms of:
– incoming links (weights) from other pages
• “Random surfer model”
Distributed Data Management, SoSe 2013, S. Michel 27
S. Brin & L. Page. The anatomy of a large-‐scale hypertextual web search engine. In WWW Conf. 1998.
New Requirements
• Map Reduce is one prominent example that novel businesses have new requirements.
• Going away from tradi9onal RDBMS.
• Addressing huge data volumes, processed in mul9ple, distributed (wide spread) data centers.
Distributed Data Management, SoSe 2013, S. Michel 28
New Requirements (Cont’d)
• Massive amounts of unstructured (text) data • Processed ofen in batches (with MapReduce).
• Huge graphs like Facebook’s friendship graph • Ofen enough to store (key, value) pairs
• No need for RDBMS overhead • Ofen wanted: open source or at least not bound to par9cular commercial product (vendor).
Distributed Data Management, SoSe 2013, S. Michel 29
Wish List • Data should always be consistent
• Provided service should be always quickly responding to requests
• Data can be (is) distributed across many machines (par99ons)
• Even if some machines fail, the system should be up and running
Distributed Data Management, SoSe 2013, S. Michel 30
4/17/13
6
CAP Theorem (Brewer's Theorem)
• System cannot provide all 3 proper9es at the same 9me: – Consistency – Availability – Par99on Tolerance
Distributed Data Management, SoSe 2013, S. Michel 31
C A
P C+P A+P
C+A
h^p://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-‐SigAct.pdf
With Huge Data Sets ….
• Par99on tolerance is strictly required
• That leaves trading off consistency and availability
Distributed Data Management, SoSe 2013, S. Michel 32
Best effort: BASE
• Basically Available • Sof State • Eventual Consistency
Distributed Data Management, SoSe 2013, S. Michel 33
see h^p://www.allthingsdistributed.com/2007/12/eventually_consistent.html W. Vogels. Eventually Consistent. ACM Queue vol. 6, no. 6, December 2008.
The NoSQL “Movement”
• No one-‐size-‐fits-‐all • Not only SQL (not necessarily “no” SQL at all) • for group of non-‐tradi9onal DBMS (not rela9onal, ofen no SQL), for different purposes – key value stores – graph databases – document stores
Distributed Data Management, SoSe 2013, S. Michel 34
Example: Key Value Stores
• Like Apache Cassandra, Amazon’s Dynamo, Riak • Handling of (K,V) pairs
• Consistent hashing of values to nodes based on their keys
• Simple CRUD opera9ons (create, read, update, delete) (no SQL, or at least not full)
Distributed Data Management, SoSe 2013, S. Michel 35
Cri9cisms
• Some DB folks say “Map Reduce is a major step backward”.
• And NoSQL is too basic and will end up re-‐inven9ng DB standards (once they need it).
• Will ask in a few weeks: What do you think?
Distributed Data Management, SoSe 2013, S. Michel 36
4/17/13
7
Cloud Compu9ng
• On demand hardware – rent your compu9ng machinery – virtualiza9on
• Google App engine, Amazon AWS, Microsof Azure – Infrastructure as a Service (IaaS) – Pla�orm as a Service (PaaS) – Sofware as a Service (SaaS)
Distributed Data Management, SoSe 2013, S. Michel 37
Cloud Compu9ng (Cont’d) • Promises “no” startup cost for own business in terms of hardware you need to buy
• Scalability: Just rent more when you need them • And return them when there is no demand • Prominent showcase: Animoto, in Amazon’s EC2. From 50 to 3,500 machines in few days.
• But also problema9c: – fully dependent on a vendors hardware/service – sensi9ve data (all your data) is with vendor, maybe stored in a diff country (likely)
Distributed Data Management, SoSe 2013, S. Michel 38
Dynamic Big Data
• Scalable, con9nuous processing of massive data streams
• Twi^er’s Storm, Yahoo! (now Apache) S4
Distributed Data Management, SoSe 2013, S. Michel 39
h^p://storm-‐project.net/
Last but not least: Fallacies of Distributed Compu9ng
1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn't change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous
Distributed Data Management, SoSe 2013, S. Michel 40
source: Peter Deutsch and others at Sun
LECTURE: CONTENT & REGULATIONS
Distributed Data Management, SoSe 2013, S. Michel 41
What you will learn in this Lecture • Most of the lecture is on processing big data
– Map Reduce, NoSQL, Cloud compu9ng • Will operate on state of the art research results and tools
• Middle way between pure systems/tools discussion and learning how to build algorithms on top of them (see Joins over MR, n-‐grams, etc.)
• But also basic (important) techniques, like consistent hashing, PageRank, Bloom filters
• Very relevant stuff. Think “CV” ;)
Distributed Data Management, SoSe 2013, S. Michel 42
4/17/13
8
• We will cri9cally discuss techniques (philosophies).
Distributed Data Management, SoSe 2013, S. Michel 43
Prerequisites • Successfully a^ended informa9on systems or database lectures.
• Prac9cal exercises require solid Java skills
• Work with systems/tools requires will to dive into APIs and installa9on procedures
Distributed Data Management, SoSe 2013, S. Michel 44
• VL 1 (18. April): Mo9va9on, Regula9ons, Big Data • VL 2 (25. April): Map Reduce 1 • VL 3 (02. Mai): Map Reduce 2 • No Lecture (09. Mai) (Himmelfahrt, Ascension) • VL 4 (16. Mai): NoSQL 1 • VL 5 (23. Mai): NoSQL 2 • No Lecture (30. Mai) (Fronleichnam, Corpus Chris9) • VL 7 (06. June): Cloud Compu9ng • VL 8 (13. June): Stream Processing • VL 9 (20. June) : Distributed RDBMS 1 • VL 10 (27. June): Distributed RDBMS 2 • VL 11 (04. July): Peer to Peer Systems • VL 12 (11. July): Open Topic 1 • VL 13 (18. July): Last Lecture / Oral exams
Distributed Data Management, SoSe 2013, S. Michel 45
Schedule of Lectures (Topics TentaWve)
Lecturer and TA • Lecturer : Sebas9an Michel (Uni Saarland)
– smichel (at) mmci.uni-‐saarland.de – Building E 1.7, Room 309 (Uni Saarland) – Phone: 0681 302 70803 – or be^er, catch me afer lecture!
• TA: Johannes Schildgen – schildgen (at) cs.uni-‐kl.de – Room: 36/340
Distributed Data Management, SoSe 2013, S. Michel 46
Organiza9on & Regula9ons
• Lecture: – Thursday – 11:45 -‐ 13:15 – Room 48-‐379
• Exercise: – Tuesday (bi-‐weekly) – 15:30 -‐ 17:00 – Room 52-‐203 – First session: May 7th.
Distributed Data Management, SoSe 2013, S. Michel 47
Lecture Organiza9on
• New Lecture (almost all slides are new).
• On topics that are ofen brand new.
• Later topics are s9ll tenta9ve.
• Please provide feedback. E.g., too slow / too fast? Important topics you want to be addressed?
Distributed Data Management, SoSe 2013, S. Michel 48
4/17/13
9
Exercises
• Assignment sheet, every two weeks • Sheet + TA session by Johannes Schildgen • Mixture of:
– Prac9cal: Implementa9on (e.g., Map Reduce) – Prac9cal: Algorithms on “paper” – Theory: Where appropriate (show that …) – Brief Essay: Explain the difference of x and y (short summary)
• Ac9ve par9cipa9on wanted! J
Distributed Data Management, SoSe 2013, S. Michel 49
Exam
• Oral Exam at the end of semester/early in semester break.
• Around 20min • Topics captured announced few (1-‐2) weeks before exams
• We assume you ac9vely par9cipated in the exercises.
Distributed Data Management, SoSe 2013, S. Michel 50
Registra9on
• Please register by email to – Sebas9an Michel and Johannes Schildgen – Use subject prefix: [ddm13] – With content:
• Your name • Matricula9on number
• In par9cular to receive announcements/news
Distributed Data Management, SoSe 2013, S. Michel 51
BIG DATA
Distributed Data Management, SoSe 2013, S. Michel 52
source: Dilbert by ScoQ Adams (cropped)
(The Big data Challenge)
What is Big Data? • Massive amounts of data from a variety of sources – Web search logs – social networks and blogs – RFID and other sensor data – sales data – scien9fic data
& it is a big buzzword!
Distributed Data Management, SoSe 2013, S. Michel 53
What is Big Data? (Cont’d)
Distributed Data Management, SoSe 2013, S. Michel 54
• Big data is ofen associated with NoSQL and MapReduce tools to process it.
• Processed in and across gigan9c data centers
• The term “Big Data” denotes not only size but things we want to/can do with it (benefits)
4/17/13
10
Tradi9onal Handling
• Data warehousing, e.g., at Walmart, Ebay, etc. Also super big and constantly growing.
• But you know your data, know what you are looking for
• Schema is “small” enough to allow human input (admin)
• It is “just” YOUR data
Distributed Data Management, SoSe 2013, S. Michel 55
“Simple” Case: Shopping Pa^erns
• Famous story: – sta9s9cian at target.com (large retailer in US) – task: figure out woman is pregnant even if she doesn’t want them to know
– even more: roughly which week/month – Why? To sell products!
Distributed Data Management, SoSe 2013, S. Michel 56
Read more: e.g., h^p://www.ny9mes.com/2012/02/19/magazine/shopping-‐habits.html?pagewanted=all&_r=0
“Simple” Case: Use of Search Logs
• Swine Flu epidemic of 2009 • Google tracks epidemic by following searches for flu-‐related topics.
Distributed Data Management, SoSe 2013, S. Michel 57 source: Google
What is different now?
• Large amounts of heterogeneous data • Take all the PBs together, not only your own one (è From TB to PB and EB)
• Manual input of humans hardly scales • Who anyway understand complex data and schema (if there is one)?
• It is now beyond asking SQL queries.
Distributed Data Management, SoSe 2013, S. Michel 58
Data Science: What it takes
• many fields touched – math, sta9s9cs – data engineering – pa^ern recogni9on and learning – natural language processing – visualiza9on – uncertainty modeling – data warehousing – high performance compu9ng
Distributed Data Management, SoSe 2013, S. Michel 59
The BIG Data Challenge: The 4 Vs
• Volume – Lots of data
• Velocity – Changing / growing data
• Variety – Heterogeneity
• Verity – True or not?
Distributed Data Management, SoSe 2013, S. Michel 60
Addressed in this lecture
According to Gartner and others.
4/17/13
11
Example: Trend Mining in Twi^er
Distributed Data Management, SoSe 2013, S. Michel 61
• Mine trends in text streams (Twi^er, RSS feeds, etc.)
• No human input. Massive amount of noisy unstructured text data.
• Wand to find trends like:
#benedictXVI #re9rement
#schavan #gu^enberg
#armstrong #doping
#cyprus #bankruptcy
Sliding Window Model and Objec9ve
Distributed Data Management, SoSe 2013, S. Michel 62
• Data valid for certain 9me
9me
• Now: Detect change in co-‐occurrence, thus emerging trend!
tag A
tag B tag A
tag B
evolving 9me
Predic9on Model and Trend Ranking
Distributed Data Management, SoSe 2013, S. Michel 63
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10
Correla9on
Predic9on
Error
§ Intensity of trend as predic9on error
§ Exponen9al smoothing forecast
Data Sources are Heterogeneous 64
super fast not controlled (noisy) text li^le structure
super fast structured
sta9c structured administered
… so is the Data 65
Music
Publica9ons
Health Data
KB of En9re Wikipedia
Why is Big Data Interes9ng? • Novel insights about customers
– Beyond pure shopping cart analyses and purchase history
– Beyond running separate surveys/polls
• Social media involvement • Demographic data • (Purchase) trend predic9on in social media (=> investment)
• Why? Money Distributed Data Management, SoSe 2013, S. Michel 66
4/17/13
12
Need to be Careful
Distributed Data Management, SoSe 2013, S. Michel 67
• Not only are facts ofen wrong • Also sta9s9cs can reveal wrong clues. • With enough data you can “tell” anything
Recap of Today’s Lecture
• Teaser for content addressed in coming lectures: – Hot topics (Map Reduce, NoSQL, Cloud Compu9ng, Big Data)
– and fundamental techniques
• Lecture regula9ons • Short excerpt on “Big Data”
Distributed Data Management, SoSe 2013, S. Michel 68
Next few lectures are on
Map Reduce
Distributed Data Management, SoSe 2013, S. Michel 69
Summary: Papers/Books/Ar9cles • Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing
on Large Clusters. OSDI 2004: 137-‐150 • W. Vogels. Eventually Consistent. ACM Queue vol. 6, no. 6, December
2008. • Nancy Lynch and Seth Gilbert, “Brewer's conjecture and the feasibility of
consistent, available, par99on-‐tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-‐59.
• In general, for NoSQL references: h^p://nosql-‐database.org/ • Hadoop (Map Reduce): Tom White. The Defini9ve Guide. 3rd edi9on.
• h^p://www.ny9mes.com/2012/02/19/magazine/shopping-‐habits.html?pagewanted=all&_r=0
Distributed Data Management, SoSe 2013, S. Michel 70