big data the backman report on database research ... · how to share data at fine-grained levels,...
TRANSCRIPT
IN5030 Protocols and routing in the internet
Big DataThe backman report on database research.Mapreduce: Simplified data processing on large clusters.
By Priyanka Srinivas Krishna.28/02/2020
1
IN5030 Protocols and routing in the internet
Part 1-The Backman report❏ Every few years a group of database researchers meets to discuss the state of
database research, its impact on practice, and important new directions.❏ This report summarizes the discussion and conclusions of the meeting.❏ The meeting participants quickly converged on big data as a defining
challenge.❏ Big data arose due to the confluence of three major trends.
❏ 1) It has become much cheaper to generate a wide variety of data.❏ 2)It has become much cheaper to process large amounts of data.
2
IN5030 Protocols and routing in the internet
cont..
❏ 3)Data Management become more democratised.❏ The process of generating, processing, and consuming data is no
longer just for database professionals. ❏ Decision makers, domain scientists, application users, journalists,
and everyday consumers now routinely do it. ❏ Due to these trends, an unprecedented volume of data needs to be
captured, stored, queried, processed, and turned into knowledge.
3
IN5030 Protocols and routing in the internet
Content of the Report● Characteristics of Big Data● Research Challenges● Community Challenges● Conclusion
4
IN5030 Protocols and routing in the internet
5vs Characteristics of Big Data
5
IN5030 Protocols and routing in the internet
Research challenges1.Scalable big/fast data infrastructure.2.Coping with diversity in data management.3.End-to-end processing of data.4.Cloud services.5.Role of people in data life cycle.
6
IN5030 Protocols and routing in the internet
Challenge 1:Scalable big data infrastructure
7
IN5030 Protocols and routing in the internet 8
Scalable big data infrastructure❏ Parallel and distributed processing
○ large-scale distributed file system, and higher level languages are seeing rapid adoption for processing less structured data, even in traditional enterprises.
❏ Query processing and optimization❏ for processing big data, powerful, ❏ costaware query optimizers and set-oriented query execution
engines are needed❏ New hardware In addition to clusters of general-purpose multicore
processors, more specialized processors should be considered.
IN5030 Protocols and routing in the internet
Scalable big data infrastructure cont..
❏ Cost efficient storage❏ The database research community must learn how best to leverage
emerging memory and storage technologies. ❏ High speed data streams
❏ For data that arrives at ever-higher speeds, new scalable techniques for ingesting and processing streams of data will be needed.
❏ Late bound schemas❏ For data that is persisted but processed just once, it makes little
sense to pay the substantial price of storing and indexing it . Instead, it should be stored as a binary file
9
IN5030 Protocols and routing in the internet
Challenge 2:Diversity in data management❏ No one-size-fits-all.❏ Cross-platform integration
❏ Integration of platforms❏ Hiding heterogeneity❏ Optimization of performance
❏ Programming models.❏ Diversity in programming abstractions and reusability❏ Need of more than one language!❏ Focus on domain- specific language
❏ Data processing workflows❏ platforms that can span both "raw" and "cooked" data.
10
IN5030 Protocols and routing in the internet
Challenge 3: end to end processing of data❏ Data-to-knowledge pipeline
❏ steps of the raw-data-to-knowledge pipeline will be largely unchanged.❏ data acquisition; selection, assessment, cleaning, and transformation,
extraction and integration etc.❏ greater diversity of data and users
❏ Tool diversity❏ need of multiple tools to solve each step of raw-data-to-knowledge pipeline
❏ Tool customizability❏ Tools should be able to exploit domain knowledge, such as dictionaries,
knowledge bases, and rules.
11
IN5030 Protocols and routing in the internet
Cont..❏ Hand crafted rules are needed along with
machine learning❏ Capturing and managing appropriate
meta-information❏ Eg. Facebook automatically identifies faces in
the image so users can optionally tag them❏ Knowledge base
❏ The more knowledge about a target domain, the better that tools can analyze the domain
12
IN5030 Protocols and routing in the internet
Challenge 4:Cloud Services
13
IN5030 Protocols and routing in the internet 14
IN5030 Protocols and routing in the internet
Some of the critical challenges to realise the vision of data PaaS in the cloud ❏ Elasticity
❏ Weather the same cloud storage service can support both transactions and and analytics.
❏ System administration: ❏ all administrative tasks must be automated.❏ Resource control parameters must also be set automatically and be
highly responsive to changes in load.❏ Multiletency:
❏ The implementation challenge is to ensure performance isolation between tenants, to ensure a burst of demand from one tenant does not cause a violation of other tenants’ SLAs.
15
IN5030 Protocols and routing in the internet
❏ Data Sharing:❏ how to find high- quality data in the cloud, ❏ how to share data at fine-grained levels, how to
distribute costs when sharing computing and data, and how to price data.
❏ how to protect data if the current cloud provider fails and to preserve data for the long term when users who need it have no personal or financial connection to those who provide it.
❏ Hybrid cloud:❏ Cyber-physical systems involve data streaming
from multiple sensors and mobile devices, and must cope with intermittent connectivity and limited battery life, which pose difficult challenges for real-time and perhaps mission-critical data management in the cloud.
Cont..
16
IN5030 Protocols and routing in the internet
❏ Hybrid cloud❏ There is a need for interoperation
of database services among the cloud, on-premise servers.
❏ users may run applications in their private cloud during normal operation, but tap into a public cloud at peak times or in response to unanticipated work- load surges
Cont..
17
IN5030 Protocols and routing in the internet
Challenge 5:Roles of humans in the data life cycle
❏ Data producers❏ develop algorithms and
incentives that guide people to produce and share the most useful data, while maintaining the desired level of data privacy
❏ Data curators❏ obtain high-quality datasets
from often-imperfect human curators.
18
IN5030 Protocols and routing in the internet
❏ Data curators cont..❏ For these people-centric challenges,
data provenance and explanation will be crucial, as will privacy and security.
❏ We need to build platforms that allow people to curate data easily and extend relevant applications to incorporate such curation.
Cont..
19
IN5030 Protocols and routing in the internet
❏ Data Consumers❏ People want to use messier data in complex ways, raising many
challenges.❏ In the enterprise, data consumers usually know how to ask SQL
queries, over a structured database.❏ Today’s data consumers may not know how to formulate a query
at all, for example, a journalist who wants to “find the average temperature of all cities with population over 100,000 in Florida” over a structured dataset.
Cont..
20
IN5030 Protocols and routing in the internet
Cont..Data consumers❏ Enabling people to get such answers themselves
requires new query interfaces, ❏ We need multimodal interfaces that combine
visualization, querying, and navigation.❏ When the query to ask is not clear, people need other
ways to browse, explore, visualize, and mine the data, to make data consumption easier.
Cont..
21
IN5030 Protocols and routing in the internet
❏ Online Communities❏ People want to create, share, and
manage data with other community members.
❏ They may want to collaboratively build community-specific knowledge bases, wikis, and tools to process data.
❏ Our challenge is to build tools to help communities produce usable data as well as to exploit, share, and mine it.
cont..
22
IN5030 Protocols and routing in the internet
Community Challenges❏ the database field faces many community issues.
❏ database education,❏ The database technology taught in standard database courses today
is increasingly disconnected from reality. It is rooted in the 1980s.❏ Rethink about database curriculum
❏ data science❏ Data scientists need skills not only in data management, but also in
business intelligence, computer systems, mathematics, statistics, machine learning, and optimization.
23
IN5030 Protocols and routing in the internet
Community challenge cont..❏ Research culture.
❏ Finally, there is much concern over the increased emphasis of citation counts instead of research impact.
❏ to pursue the big data agenda effectively, the field needs to return to a state where fewer publications per researcher per time unit is the norm,❏ and where large systems projects, end- to-end tool sets, and data
sharing are more highly valued.
24
IN5030 Protocols and routing in the internet
Conclusion ❏ It is the exciting time for database research,In the past, Database research
has been restricted by the rigors of the enterprise and relational database systems
❏ Exciting new research challenges related to processing big data;Handling data diversity; exploiting new hardware, software, and cloud-based platforms;
❏ It is also time to rethink approaches to education, involvement with data consumers, and our value system and its impact on how we evaluate
25
IN5030 Protocols and routing in the internet
Part 2- Mapreduce: simplified data processing on large clusters
❏ What happens in one internet minute.
❏ Data is growing faster❏ When it comes to
dealing with a massive amount of data from social media, or any other relevant source, big data analysis is the most favourable option.
26
IN5030 Protocols and routing in the internet
❏ Technologies like Hadoop, Yarn, NoSQL, Hive, Spark, etc., are soaring across the digital lake for fetching useful insights hidden inside the data.
❏ we are going to uncover the working of Hadoop’s core heart i.e., MapReduce.
cont..
27
IN5030 Protocols and routing in the internet
What is Mapreduce❏ MapReduce is a programming model and an
associated implementation for processing and generating big data sets.
❏ Parallel processing technique rather than the serial
❏ Distributed on commodity cluster.❏ MapReduce program, Map() and Reduce()
are two functions. The ❏ Map function performs actions like
filtering, grouping and sorting.❏ Reduce is for aggregation.
28
IN5030 Protocols and routing in the internet
Programming model1. Input & Output: each a set of key/value pairs
❏ Programmer specifies two functions:
❏ map (in_key, in_value) -> list(out_key, intermediate_value)
❏ Processes input key/value pair
❏ Produces set of intermediate pairs
❏ reduce (out_key, list(intermediate_value)) -> list(out_value)
❏ Combines all intermediate values for a particular key
❏ Produces a set of merged output values (usually just one)
29
IN5030 Protocols and routing in the internet
Example:count word occurrencesKey value pairsMapreduce refers to two seperateDistinct tasks:The Map job:Input:input formatOutput: intermediate formatThe reduce jobInput: Intermediate formatOutput: Output formatIntermediate value iterator
30
IN5030 Protocols and routing in the internet 31
IN5030 Protocols and routing in the internet
Implementation overview❏ Many different implementations are possible❏ The right choice is depending on the environment.❏ Here we describe implementation targeted to the computing environment❏ Typical cluster: (wide use at Google, large clusters of PC’s connected via
switched nets)❏ In our environment
❏ Machines are typically dual-processors x86 machines, Linux, 2-4 GB of memory per machine.
32
IN5030 Protocols and routing in the internet 33
Cont..❏ (2) Commodity networking hardware is used❏ (3) A cluster consists of hundreds or thousands of machines, and
therefore machine failures are common.❏ (4) Storage is provided by inexpensive IDE disks attached directly to
individual machines❏ (5) Users submit jobs to a scheduling system
IN5030 Protocols and routing in the internet
Execution overview❏ Map and reduce invocations are distributed across multiple PC’s as
follows:❏ Partition input key/value pairs into M chunks, run map() tasks in
parallel❏ After map()’s are complete, merge all emitted values for each emitted
intermediate key❏ then partition space of output map keys into R pieces( user), and run
reduce() in parallel.❏ If map() or reduce() fails, fault tolerance technique is used.
34
IN5030 Protocols and routing in the internet
Execution overview
35
IN5030 Protocols and routing in the internet
merges all intermediate values associated with the same intermediate key.
36
IN5030 Protocols and routing in the internet
Parallel execution
37
IN5030 Protocols and routing in the internet
Fault tolerance❏ Master pings every worker periodically
❏ If no response from the worker in certain amount of time it marked as failed❏ Any map task completed by the worker are reset back to their initial
idle state.❏ Master Failure
❏ It is easy to make the master write periodic checkpoints of the master data structure
❏ If the master dies a new copy can be started from the last checkpoint state
❏ However there is one single master,its failure is unlikely ❏ Aborts the MR computation if master fails
38
IN5030 Protocols and routing in the internet
Performance❏ Measure the performance of MapReduce on two computations running on a
large cluster of machines.❏ MR_GrepScan
❏ searches through approximately one terabyte of data looking for a particular pattern
❏ MR_Sort ❏ sorts approximately one terabyte of data
39
IN5030 Protocols and routing in the internet
Performance cont..
40
IN5030 Protocols and routing in the internet
MR Grep_scan
❏ Scans 10 billions 100-byte records, searching for rare 3-character pattern (occurs in 92,337 records).
❏ input is split into approximately 64MB pieces (M = 15000), entire output is placed in one file , R = 1
❏ Startup overhead is significant for short jobs
Data Transfer rate over time
41
IN5030 Protocols and routing in the internet
MR_Sort
❏ Backup tasks improves completion time reasonably
❏ System manages machine failures relatively quickly.
42
IN5030 Protocols and routing in the internet
Disadvantages❏ For iterating kind of computing in case of Statistics inference for machine learning it
can be difficult to use.❏ Data parallelism is key
❏ Need to be able to break up a problem by data chunks❏ MapReduce is closed-source (to Google) C++
❏ Hadoop is open-source Java-based rewrite
43
IN5030 Protocols and routing in the internet
Conclusion❏ MapReduce has proven to be a useful abstraction❏ Greatly simplifies large-scale computations ❏ Fun to use: focus on problem, let library deal with messy details❏ No big need for parallelization knowledge (relief the user from dealing
with low level parallelization details)
44
IN5030 Protocols and routing in the internet
References❏ J. Dean and S. Ghemawat.
❏ MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. (Paper and slides)
❏ The Beckman report(2016)❏ Dan Weld’s at U. Washington
❏ (tutorial & slides)❏ Ruoming Jin, Ge Yang, and Gagan Agrawal
❏ Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance(pdf 2004)
❏ HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads [R4]
45
IN5030 Protocols and routing in the internet
Thank you!
46