an introduction to hadoop for large scale data analysis
DESCRIPTION
TRANSCRIPT
![Page 1: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/1.jpg)
Hadoop – Large scale data analysis
Abhijit Sharma
Page 1 | 04/10/2023
![Page 2: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/2.jpg)
Unprecedented growth in ◦ Data set size - Facebook 21+ PB data
warehouse, 12+ TB/day◦ Un(semi)-structured data – logs, documents,
graphs◦ Connected data web, tags, graphs
Relevant to enterprises – logs, social media, machine generated data, breaking of silos
Page 2 | 04/10/2023
Big Data Trends
![Page 3: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/3.jpg)
Page 3 | 04/10/2023
Putting Big Data to work
Data driven Org – decision support, new offerings◦ Analytics on large data
sets (FB Insights – Page, App etc stats),
◦ Data Mining – Clustering - Google News articles
◦ Search - Google
![Page 4: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/4.jpg)
Embarrassingly data parallel problems◦ Data chunked & distributed across cluster◦ Parallel processing with data locality – task
dispatched where data is◦ Horizontal/Linear scaling approach using
commodity hardware◦ Write Once, Read Many
◦ Examples Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links
Page 4 | 04/10/2023
Problem characteristics and examples
![Page 5: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/5.jpg)
Open source system for large scale batch distributed computing on big data◦ Map Reduce Programming Paradigm & Framework ◦ Map Reduce Infrastructure◦ Distributed File System (HDFS)
Endorsed/used extensively by web giants – Google, FB, Yahoo!
Page 5 | 04/10/2023
What is Hadoop?
![Page 6: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/6.jpg)
Map Reduce is a programming model and an implementation for parallel processing of large data sets
Map processes each logical record per input split to generate a set of intermediate key/value pairs
Reduce merges all intermediate values
associated with the same intermediate key
Page 6 | 04/10/2023
Map Reduce - Definition
![Page 7: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/7.jpg)
Map : Apply a function to each list member - Parallelizable
[1, 2, 3].collect { it * it } Output : [1, 2, 3] -> Map (Square) : [1, 4, 9]
Reduce : Apply a function and an accumulator to each list member
[1, 2, 3].inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Reduce (Sum) : 6
Map & Reduce
[1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14
Page 7 | 04/10/2023
Map Reduce - Functional Programming Origins
![Page 8: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/8.jpg)
Page 8 | 04/10/2023
Word Count - Shell
cat * | grep | sort | uniq –cinput| map | shuffle & sort | reduce
![Page 9: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/9.jpg)
Page 9 | 04/10/2023
Word Count - Map Reduce
![Page 10: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/10.jpg)
mapper (filename, file-contents): for each word in file-contents: emit (word, 1) // single count for a word e.g. (“the”, 1) for each
occurrence of “the”
reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..])
sum = 0 for each value in intermediate_values: sum = sum + value emit (word, sum)
Page 10 | 04/10/2023
Word Count - Pseudo code
![Page 11: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/11.jpg)
Word Count / Distributed logs search for # accesses to various URLs◦ Map – emits word/URL, 1 for each doc/log split◦ Reduce – sums up the counts for a specific word/URL
Term Vector generation – term -> [doc-id]◦ Map – emits term, doc-id for each doc split◦ Reduce – Identity Reducer – accumulates the (term, [doc-id,
doc-id ..]) Reverse Links – source -> target to target->
source◦ Map – emits (target, source) for each doc split◦ Reducer – Identity Reducer – accumulates the (target,
[source, source ..])
Page 11 | 04/10/2023
Examples – Map Reduce Defn
![Page 12: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/12.jpg)
Hides complexity of distributed computing
◦ Automatic parallelization of job◦ Automatic data chunking & distribution (via HDFS)◦ Data locality – MR task dispatched where data is◦ Fault tolerant to server, storage, N/W failures◦ Network and disk transfer optimization◦ Load balancing
Page 12 | 04/10/2023
Map Reduce – Hadoop Implementation
![Page 13: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/13.jpg)
Page 13 | 04/10/2023
Hadoop Map Reduce Architecture
![Page 14: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/14.jpg)
Very large files – block size 64 MB/128 MB Data access pattern - Write once read many Writes are large, create & append only Reads are large & streaming Commodity hardware Tolerant to failure – server, storage, network Highly available through transparent
replication Throughput is more important than latency
Page 14 | 04/10/2023
HDFS Characteristics
![Page 15: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/15.jpg)
Page 15 | 04/10/2023
HDFS Architecture
![Page 16: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/16.jpg)
Thanks
Page 16 | 04/10/2023
![Page 17: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/17.jpg)
Page 17 | 04/10/2023
Backup Slides
![Page 18: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/18.jpg)
Page 18 | 04/10/2023
Map & Reduce Functions
![Page 19: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/19.jpg)
Page 19 | 04/10/2023
Job Configuration
![Page 20: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/20.jpg)
Job Tracker tracks MR jobs – runs on master node
Task Tracker◦ Runs on data nodes and tracks Mapper, Reducer
tasks assigned to the node◦ Heartbeats to Job Tracker◦ Maintains and picks up tasks from a queue
Page 20 | 04/10/2023
Hadoop Map Reduce Components
![Page 21: An introduction to Hadoop for large scale data analysis](https://reader034.vdocuments.site/reader034/viewer/2022051818/54be2d774a79598c1e8b4572/html5/thumbnails/21.jpg)
Name Node ◦ Manages the file system namespace and regulates access to
files by clients – stores meta data◦ Mapping of blocks to Data Nodes and replicas◦ Manage replication◦ Executes file system namespace operations like opening,
closing, and renaming files and directories. Data Node
◦ One per node, which manages local storage attached to the node
◦ Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes
◦ Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node.
Page 21 | 04/10/2023
HDFS