big data the backman report on database research ... · how to share data at fine-grained levels,...

46
IN5030 Protocols and routing in the internet Big Data The backman report on database research. Mapreduce: Simplified data processing on large clusters. By Priyanka Srinivas Krishna. 28/02/2020 1

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Big DataThe backman report on database research.Mapreduce: Simplified data processing on large clusters.

By Priyanka Srinivas Krishna.28/02/2020

1

Page 2: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Part 1-The Backman report❏ Every few years a group of database researchers meets to discuss the state of

database research, its impact on practice, and important new directions.❏ This report summarizes the discussion and conclusions of the meeting.❏ The meeting participants quickly converged on big data as a defining

challenge.❏ Big data arose due to the confluence of three major trends.

❏ 1) It has become much cheaper to generate a wide variety of data.❏ 2)It has become much cheaper to process large amounts of data.

2

Page 3: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

cont..

❏ 3)Data Management become more democratised.❏ The process of generating, processing, and consuming data is no

longer just for database professionals. ❏ Decision makers, domain scientists, application users, journalists,

and everyday consumers now routinely do it. ❏ Due to these trends, an unprecedented volume of data needs to be

captured, stored, queried, processed, and turned into knowledge.

3

Page 4: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Content of the Report● Characteristics of Big Data● Research Challenges● Community Challenges● Conclusion

4

Page 5: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

5vs Characteristics of Big Data

5

Page 6: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Research challenges1.Scalable big/fast data infrastructure.2.Coping with diversity in data management.3.End-to-end processing of data.4.Cloud services.5.Role of people in data life cycle.

6

Page 7: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Challenge 1:Scalable big data infrastructure

7

Page 8: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet 8

Scalable big data infrastructure❏ Parallel and distributed processing

○ large-scale distributed file system, and higher level languages are seeing rapid adoption for processing less structured data, even in traditional enterprises.

❏ Query processing and optimization❏ for processing big data, powerful, ❏ costaware query optimizers and set-oriented query execution

engines are needed❏ New hardware In addition to clusters of general-purpose multicore

processors, more specialized processors should be considered.

Page 9: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Scalable big data infrastructure cont..

❏ Cost efficient storage❏ The database research community must learn how best to leverage

emerging memory and storage technologies. ❏ High speed data streams

❏ For data that arrives at ever-higher speeds, new scalable techniques for ingesting and processing streams of data will be needed.

❏ Late bound schemas❏ For data that is persisted but processed just once, it makes little

sense to pay the substantial price of storing and indexing it . Instead, it should be stored as a binary file

9

Page 10: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Challenge 2:Diversity in data management❏ No one-size-fits-all.❏ Cross-platform integration

❏ Integration of platforms❏ Hiding heterogeneity❏ Optimization of performance

❏ Programming models.❏ Diversity in programming abstractions and reusability❏ Need of more than one language!❏ Focus on domain- specific language

❏ Data processing workflows❏ platforms that can span both "raw" and "cooked" data.

10

Page 11: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Challenge 3: end to end processing of data❏ Data-to-knowledge pipeline

❏ steps of the raw-data-to-knowledge pipeline will be largely unchanged.❏ data acquisition; selection, assessment, cleaning, and transformation,

extraction and integration etc.❏ greater diversity of data and users

❏ Tool diversity❏ need of multiple tools to solve each step of raw-data-to-knowledge pipeline

❏ Tool customizability❏ Tools should be able to exploit domain knowledge, such as dictionaries,

knowledge bases, and rules.

11

Page 12: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Cont..❏ Hand crafted rules are needed along with

machine learning❏ Capturing and managing appropriate

meta-information❏ Eg. Facebook automatically identifies faces in

the image so users can optionally tag them❏ Knowledge base

❏ The more knowledge about a target domain, the better that tools can analyze the domain

12

Page 13: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Challenge 4:Cloud Services

13

Page 14: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet 14

Page 15: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Some of the critical challenges to realise the vision of data PaaS in the cloud ❏ Elasticity

❏ Weather the same cloud storage service can support both transactions and and analytics.

❏ System administration: ❏ all administrative tasks must be automated.❏ Resource control parameters must also be set automatically and be

highly responsive to changes in load.❏ Multiletency:

❏ The implementation challenge is to ensure performance isolation between tenants, to ensure a burst of demand from one tenant does not cause a violation of other tenants’ SLAs.

15

Page 16: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

❏ Data Sharing:❏ how to find high- quality data in the cloud, ❏ how to share data at fine-grained levels, how to

distribute costs when sharing computing and data, and how to price data.

❏ how to protect data if the current cloud provider fails and to preserve data for the long term when users who need it have no personal or financial connection to those who provide it.

❏ Hybrid cloud:❏ Cyber-physical systems involve data streaming

from multiple sensors and mobile devices, and must cope with intermittent connectivity and limited battery life, which pose difficult challenges for real-time and perhaps mission-critical data management in the cloud.

Cont..

16

Page 17: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

❏ Hybrid cloud❏ There is a need for interoperation

of database services among the cloud, on-premise servers.

❏ users may run applications in their private cloud during normal operation, but tap into a public cloud at peak times or in response to unanticipated work- load surges

Cont..

17

Page 18: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Challenge 5:Roles of humans in the data life cycle

❏ Data producers❏ develop algorithms and

incentives that guide people to produce and share the most useful data, while maintaining the desired level of data privacy

❏ Data curators❏ obtain high-quality datasets

from often-imperfect human curators.

18

Page 19: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

❏ Data curators cont..❏ For these people-centric challenges,

data provenance and explanation will be crucial, as will privacy and security.

❏ We need to build platforms that allow people to curate data easily and extend relevant applications to incorporate such curation.

Cont..

19

Page 20: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

❏ Data Consumers❏ People want to use messier data in complex ways, raising many

challenges.❏ In the enterprise, data consumers usually know how to ask SQL

queries, over a structured database.❏ Today’s data consumers may not know how to formulate a query

at all, for example, a journalist who wants to “find the average temperature of all cities with population over 100,000 in Florida” over a structured dataset.

Cont..

20

Page 21: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Cont..Data consumers❏ Enabling people to get such answers themselves

requires new query interfaces, ❏ We need multimodal interfaces that combine

visualization, querying, and navigation.❏ When the query to ask is not clear, people need other

ways to browse, explore, visualize, and mine the data, to make data consumption easier.

Cont..

21

Page 22: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

❏ Online Communities❏ People want to create, share, and

manage data with other community members.

❏ They may want to collaboratively build community-specific knowledge bases, wikis, and tools to process data.

❏ Our challenge is to build tools to help communities produce usable data as well as to exploit, share, and mine it.

cont..

22

Page 23: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Community Challenges❏ the database field faces many community issues.

❏ database education,❏ The database technology taught in standard database courses today

is increasingly disconnected from reality. It is rooted in the 1980s.❏ Rethink about database curriculum

❏ data science❏ Data scientists need skills not only in data management, but also in

business intelligence, computer systems, mathematics, statistics, machine learning, and optimization.

23

Page 24: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Community challenge cont..❏ Research culture.

❏ Finally, there is much concern over the increased emphasis of citation counts instead of research impact.

❏ to pursue the big data agenda effectively, the field needs to return to a state where fewer publications per researcher per time unit is the norm,❏ and where large systems projects, end- to-end tool sets, and data

sharing are more highly valued.

24

Page 25: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Conclusion ❏ It is the exciting time for database research,In the past, Database research

has been restricted by the rigors of the enterprise and relational database systems

❏ Exciting new research challenges related to processing big data;Handling data diversity; exploiting new hardware, software, and cloud-based platforms;

❏ It is also time to rethink approaches to education, involvement with data consumers, and our value system and its impact on how we evaluate

25

Page 26: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Part 2- Mapreduce: simplified data processing on large clusters

❏ What happens in one internet minute.

❏ Data is growing faster❏ When it comes to

dealing with a massive amount of data from social media, or any other relevant source, big data analysis is the most favourable option.

26

Page 27: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

❏ Technologies like Hadoop, Yarn, NoSQL, Hive, Spark, etc., are soaring across the digital lake for fetching useful insights hidden inside the data.

❏ we are going to uncover the working of Hadoop’s core heart i.e., MapReduce.

cont..

27

Page 28: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

What is Mapreduce❏ MapReduce is a programming model and an

associated implementation for processing and generating big data sets.

❏ Parallel processing technique rather than the serial

❏ Distributed on commodity cluster.❏ MapReduce program, Map() and Reduce()

are two functions. The ❏ Map function performs actions like

filtering, grouping and sorting.❏ Reduce is for aggregation.

28

Page 29: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Programming model1. Input & Output: each a set of key/value pairs

❏ Programmer specifies two functions:

❏ map (in_key, in_value) -> list(out_key, intermediate_value)

❏ Processes input key/value pair

❏ Produces set of intermediate pairs

❏ reduce (out_key, list(intermediate_value)) -> list(out_value)

❏ Combines all intermediate values for a particular key

❏ Produces a set of merged output values (usually just one)

29

Page 30: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Example:count word occurrencesKey value pairsMapreduce refers to two seperateDistinct tasks:The Map job:Input:input formatOutput: intermediate formatThe reduce jobInput: Intermediate formatOutput: Output formatIntermediate value iterator

30

Page 31: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet 31

Page 32: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Implementation overview❏ Many different implementations are possible❏ The right choice is depending on the environment.❏ Here we describe implementation targeted to the computing environment❏ Typical cluster: (wide use at Google, large clusters of PC’s connected via

switched nets)❏ In our environment

❏ Machines are typically dual-processors x86 machines, Linux, 2-4 GB of memory per machine.

32

Page 33: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet 33

Cont..❏ (2) Commodity networking hardware is used❏ (3) A cluster consists of hundreds or thousands of machines, and

therefore machine failures are common.❏ (4) Storage is provided by inexpensive IDE disks attached directly to

individual machines❏ (5) Users submit jobs to a scheduling system

Page 34: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Execution overview❏ Map and reduce invocations are distributed across multiple PC’s as

follows:❏ Partition input key/value pairs into M chunks, run map() tasks in

parallel❏ After map()’s are complete, merge all emitted values for each emitted

intermediate key❏ then partition space of output map keys into R pieces( user), and run

reduce() in parallel.❏ If map() or reduce() fails, fault tolerance technique is used.

34

Page 35: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Execution overview

35

Page 36: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

merges all intermediate values associated with the same intermediate key.

36

Page 37: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Parallel execution

37

Page 38: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Fault tolerance❏ Master pings every worker periodically

❏ If no response from the worker in certain amount of time it marked as failed❏ Any map task completed by the worker are reset back to their initial

idle state.❏ Master Failure

❏ It is easy to make the master write periodic checkpoints of the master data structure

❏ If the master dies a new copy can be started from the last checkpoint state

❏ However there is one single master,its failure is unlikely ❏ Aborts the MR computation if master fails

38

Page 39: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Performance❏ Measure the performance of MapReduce on two computations running on a

large cluster of machines.❏ MR_GrepScan

❏ searches through approximately one terabyte of data looking for a particular pattern

❏ MR_Sort ❏ sorts approximately one terabyte of data

39

Page 40: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Performance cont..

40

Page 41: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

MR Grep_scan

❏ Scans 10 billions 100-byte records, searching for rare 3-character pattern (occurs in 92,337 records).

❏ input is split into approximately 64MB pieces (M = 15000), entire output is placed in one file , R = 1

❏ Startup overhead is significant for short jobs

Data Transfer rate over time

41

Page 42: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

MR_Sort

❏ Backup tasks improves completion time reasonably

❏ System manages machine failures relatively quickly.

42

Page 43: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Disadvantages❏ For iterating kind of computing in case of Statistics inference for machine learning it

can be difficult to use.❏ Data parallelism is key

❏ Need to be able to break up a problem by data chunks❏ MapReduce is closed-source (to Google) C++

❏ Hadoop is open-source Java-based rewrite

43

Page 44: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Conclusion❏ MapReduce has proven to be a useful abstraction❏ Greatly simplifies large-scale computations ❏ Fun to use: focus on problem, let library deal with messy details❏ No big need for parallelization knowledge (relief the user from dealing

with low level parallelization details)

44

Page 45: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

References❏ J. Dean and S. Ghemawat.

❏ MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. (Paper and slides)

❏ The Beckman report(2016)❏ Dan Weld’s at U. Washington

❏ (tutorial & slides)❏ Ruoming Jin, Ge Yang, and Gagan Agrawal

❏ Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance(pdf 2004)

❏ HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads [R4]

45

Page 46: Big Data The backman report on database research ... · how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data

IN5030 Protocols and routing in the internet

Thank you!

46