copyright © 2016 ramez elmasri and shamkant b. navathe · n apache tez n extensible framework...
TRANSCRIPT
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
CHAPTER 25
Big Data Technologies Basedon MapReduce and Hadoop
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction
n Phenomenal growth in data generationn Social median Sensorsn Communications networks and satellite imageryn User-specific business data
n “Big data” refers to massive amounts of datan Exceeds the typical reach of a DBMS
n Big data analytics
Slide 25- 3
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.1 What is Big Data?
n Big data ranges from terabytes (1012 bytes) or petabytes (1015 bytes) to exobytes (1018 bytes)
n Volumen Refers to size of data managed by the system
n Velocityn Speed of data creation, ingestion, and processing
n Varietyn Refers to type of data sourcen Structured, unstructured
Slide 25- 4
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
What is Big Data? (cont’d.)
n Veracityn Credibility of the sourcen Suitability of data for the target audiencen Evaluated through quality testing or credibility
analysis
Slide 25- 5
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.2 Introduction to MapReduce and Hadoop
n Core components of Hadoopn MapReduce programming paradigmn Hadoop Distributed File System (HDFS)
n Hadoop originated from quest for open source search enginen Developed by Cutting and Carafella in 2004n Cutting joined Yahoo in 2006n Yahoo spun off Hadoop-centered company in
2011n Tremendous growth
Slide 25- 6
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Introduction to MapReduce and Hadoop (cont’d.)
n MapReducen Fault-tolerant implementation and runtime
environmentn Developed by Dean and Ghemawat at Google in
2004n Programming style: map and reduce tasks
n Automatically parallelized and executed on large clusters of commodity hardware
n Allows programmers to analyze very large datasets
n Underlying data model assumed: key-value pairSlide 25- 7
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model
n Mapn Generic function that takes a key of type K1 and
value of type V1n Returns a list of key-value pairs of type K2 and V2
n Reducen Generic function that takes a key of type K2 and a
list of values V2 and returns pairs of type (K3, V3)n Outputs from the map function must match the
input type of the reduce function
Slide 25- 8
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
Slide 25-9
Figure 25.1 Overview of MapReduce execution (Adapted from T. White, 2012)
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n MapReduce examplen Make a list of frequencies of words in a documentn Pseudocode
Slide 25- 10
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n MapReduce example (cont’d.)n Actual MapReduce code
Slide 25- 11
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n Distributed grepn Looks for a given pattern in a filen Map function emits a line if it matches a supplied
patternn Reduce function is an identity function
n Reverse Web-link graphn Outputs (target URL, source URL) pairs for each
link to a target page found in a source page
Slide 25- 12
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n Inverted indexn Builds an inverted index based on all words
present in a document repositoryn Map function parses each document
n Emits a sequence of (word, document_id) pairsn Reduce function takes all pairs for a given word
and sorts them by document_idn Job
n Code for Map and Reduce phases, a set of artifacts, and properties
Slide 25- 13
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The MapReduce Programming Model (cont’d.)
n Hadoop releasesn 1.x features
n Continuation of the original code basen Additions include security, additional HDFS and
MapReduce improvementsn 2.x features
n YARN (Yet Another Resource Navigator)n A new MR runtime that runs on top of YARNn Improved HDFS that supports federation and
increased availability
Slide 25- 14
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.3 Hadoop Distributed File System (HDFS)
n HDFSn File system component of Hadoopn Designed to run on a cluster of commodity
hardwaren Patterned after UNIX file systemn Provides high-throughput access to large datasetsn Stores metadata on NameNode servern Stores application data on DataNode servers
n File content replicated on multiple DataNodes
Slide 25- 15
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Distributed File System (cont’d.)
n HDFS design assumptions and goalsn Hardware failure is the normn Batch processingn Large datasetsn Simple coherency model
n HDFS architecturen Master-slaven Decouples metadata from data operationsn Replication provides reliability and high availabilityn Network traffic minimized
Slide 25- 16
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Distributed File System (cont’d.)
n NameNoden Maintains image of the file system
n i-nodes and corresponding block locationsn Changes maintained in write-ahead commit log
called Journaln Secondary NameNodes
n Checkpointing role or backup rolen DataNodes
n Stores blocks in node’s native file systemn Periodically reports state to the NameNode
Slide 25- 17
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Distributed File System (cont’d.)
n File I/O operationsn Single-writer, multiple-reader modeln Files cannot be updated, only appendedn Write pipeline set up to minimize network
utilizationn Block placement
n Nodes of Hadoop cluster typically spread across many racks
n Nodes on a rack share a switch
Slide 25- 18
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Distributed File System (cont’d.)
n Replica managementn NameNode tracks number of replicas and block
locationn Based on block reports
n Replication priority queue contains blocks that need to be replicated
n HDFS scalabilityn Yahoo cluster achieved 14 petabytes, 4000 nodes,
15k clients, and 600 million files
Slide 25- 19
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The Hadoop Ecosystem
n Related projects with additional functionalityn Pig and hive
n Provides higher-level interface for working with Hadoop framework
n Oozien Service for scheduling and running workflows of
jobsn Sqoop
n Library and runtime environment for efficiently moving data between relational databases and HDFS
Slide 25- 20
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
The Hadoop Ecosystem (cont’d.)
n Related projects with additional functionality (cont’d.)n HBase
n Column-oriented key-value store that uses HDFS
Slide 25- 21
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.4 MapReduce: Additional Details
n MapReduce runtime environmentn JobTracker
n Master processn Responsible for managing the life cycle of Jobs and
scheduling Tasks on the clustern TaskTracker
n Slave processn Runs on all Worker nodes of the cluster
Slide 25- 22
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Overall flow of a MapReduce jobn Job submissionn Job initializationn Task assignmentn Task executionn Job completion
Slide 25- 23
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Fault tolerance in MapReducen Task failure
n Runtime exceptionn Java virtual machine crashn No timely updates from the task process
n TaskTracker failuren Crash or disconnection from JobTrackern Failed Tasks are rescheduled
n JobTracker failuren Not a recoverable failure in Hadoop v1
Slide 25- 24
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n The shuffle proceduren Reducers get all the rows for a given key togethern Map phase
n Background thread partitions buffered rows based on the number of Reducers in the job and the Partitioner
n Rows sorted on key valuesn Comparator or Combiner may be used
n Copy phasen Reduce phase
Slide 25- 25
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Job schedulingn JobTracker schedules work on cluster nodesn Fair Scheduler
n Provides fast response time to small jobs in a Hadoop shared cluster
n Capacity Schedulern Geared to meet needs of large enterprise
customers
Slide 25- 26
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Strategies for equi-joins in MapReduce environmentn Sort-merge joinn Map-side hash joinn Partition joinn Bucket joinsn N-way map-side joinsn Simple N-way joins
Slide 25- 27
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Apache Pign Bridges the gap between declarative-style
interfaces such as SQL, and rigid style required by MapReduce
n Designed to solve problems such as ad hoc analyses of Web logs and clickstreams
n Accommodates user-defined functions
Slide 25- 28
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
MapReduce: Additional Details (cont’d.)
n Apache Hiven Provides a higher-level interface to Hadoop using
SQL-like queriesn Supports processing of aggregate analytical
queries typical of data warehousesn Developed at Facebook
Slide 25- 29
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hive System Architecture and Components
Slide 25-30
Figure 25.2 Hive system architecture and components
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Advantages of the Hadoop/MapReduce Technology
n Disk seek rate a limiting factor when dealing with very large data setsn Limited by disk mechanical structure
n Transfer speed is an electronic feature and increasing steadily
n MapReduce processes large datasets in paralleln MapReduce handles semistructured data and
key-value datasets more easilyn Linear scalability
Slide 25- 31
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.5 Hadoop v2 (Alias YARN)
n Reasons for developing Hadoop v2n JobTracker became a bottleneckn Cluster utilization less than desirablen Different types of applications did not fit into the
MR modeln Difficult to keep up with new open source versions
of Hadoop
Slide 25- 32
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
YARN Architecture
n Separates cluster resource management from Jobs management
n ResourceManager and NodeManager together form a platform for hosting any application on YARN
n ApplicationMasters send ResourceRequests to the ResourceManager which then responds with cluster Container leases
n NodeManager responsible for managing Containers on their nodes
Slide 25- 33
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Hadoop Version Schematics
Slide 25-34
Figure 25.3 The Hadoop v1 vs. Hadoop v2 schematic
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
Other Frameworks on YARN
n Apache Tezn Extensible framework being developed at
Hortonworks for building high-performance applications in YARN
n Apache Giraphn Open-source implementation of Google’s Pregel
system, a large-scale graph processing system used to calculate Page-Rank
n Hoya: HBase on YARNn More flexibility and improved cluster utilization
Slide 25- 35
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.6 General Discussion
n Hadoop/MapReduce versus parallel RDBMSn 2009: performance of two approaches measured
n Parallel database took longer to tune compared to MR
n Performance of parallel database 3-6 times faster than MR
n MR improvements since 2009n Hadoop has upfront cost advantage
n Open source platform
Slide 25- 36
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n MR able to handle semistructured datasetsn Support for unstructured data on the rise in
RDBMSsn Higher level language support
n SQL for RDBMSsn Hive has incorporated SQL features in HiveQL
n Fault-tolerance: advantage of MR-based systems
Slide 25- 37
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n Big data somewhat dependent on cloud technology
n Cloud model offers flexibilityn Scaling out and scaling upn Distributed software and interchangeable
resourcesn Unpredictable computing needs not uncommon in
big data projectsn High availability and durability
Slide 25- 38
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n Data locality issuesn Network load a concernn Self-configurable, locality-based data and virtual
machine management framework proposedn Enables access of data locally
n Caching techniques also improve performancen Resource optimization
n Challenge: optimize globally across all jobs in the cloud rather than per-job resource optimizations
Slide 25- 39
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n YARN as a data service platformn Emerging trend: Hadoop as a data lake
n Contains significant portion of enterprise datan Processing happens
n Support for SQL in Hadoop is improvingn Apache Storm
n Distributed scalable streaming enginen Allows users to process real-time data feeds
n Storm on YARN and SAS on YARN
Slide 25- 40
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n Challenges faced by big data technologiesn Heterogeneity of informationn Privacy and confidentialityn Need for visualization and better human interfacesn Inconsistent and incomplete information
Slide 25- 41
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
General Discussion (cont’d.)
n Building data solutions on Hadoopn May involve assembling ETL (extract, transform,
load) processing, machine learning, graph processing, and/or report creation
n Programming models and metadata not unifiedn Analytics application developers must try to
integrate services into coherent solutionn Cluster a vast resource of main memory and flash
storagen In-memory data enginesn Spark platform from Databricks
Slide 25- 42
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
25.7 Summary
n Big data technologies at the center of data analytics and machine learning applications
n MapReducen Hadoop Distributed File Systemn Hadoop v2 or YARN
n Generic data services platformn MapReduce/Hadoop versus parallel DBMSs
Slide 25- 43