an introduction to hadoop for large scale data analysis

Hadoop – Large scale data analysis

Abhijit Sharma

| 04/10/2023

Unprecedented growth in ◦ Data set size - Facebook 21+ PB data

warehouse, 12+ TB/day◦ Un(semi)-structured data – logs, documents,

graphs◦ Connected data web, tags, graphs

Relevant to enterprises – logs, social media, machine generated data, breaking of silos

| 04/10/2023

Big Data Trends

| 04/10/2023

Putting Big Data to work

Data driven Org – decision support, new offerings◦ Analytics on large data

sets (FB Insights – Page, App etc stats),

◦ Data Mining – Clustering - Google News articles

◦ Search - Google

Embarrassingly data parallel problems◦ Data chunked & distributed across cluster◦ Parallel processing with data locality – task

dispatched where data is◦ Horizontal/Linear scaling approach using

commodity hardware◦ Write Once, Read Many

◦ Examples Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links

| 04/10/2023

Problem characteristics and examples

Open source system for large scale batch distributed computing on big data◦ Map Reduce Programming Paradigm & Framework ◦ Map Reduce Infrastructure◦ Distributed File System (HDFS)

Endorsed/used extensively by web giants – Google, FB, Yahoo!

| 04/10/2023

What is Hadoop?

Map Reduce is a programming model and an implementation for parallel processing of large data sets

Map processes each logical record per input split to generate a set of intermediate key/value pairs

Reduce merges all intermediate values

associated with the same intermediate key

| 04/10/2023

Map Reduce - Definition

Map : Apply a function to each list member - Parallelizable

[1, 2, 3].collect { it * it } Output : [1, 2, 3] -> Map (Square) : [1, 4, 9]

Reduce : Apply a function and an accumulator to each list member

[1, 2, 3].inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Reduce (Sum) : 6

Map & Reduce

[1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14

| 04/10/2023

Map Reduce - Functional Programming Origins

| 04/10/2023

Word Count - Map Reduce

mapper (filename, file-contents): for each word in file-contents: emit (word, 1) // single count for a word e.g. (“the”, 1) for each

occurrence of “the”

reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..])

sum = 0 for each value in intermediate_values: sum = sum + value emit (word, sum)

| 04/10/2023

Word Count - Pseudo code

Word Count / Distributed logs search for # accesses to various URLs◦ Map – emits word/URL, 1 for each doc/log split◦ Reduce – sums up the counts for a specific word/URL

Term Vector generation – term -> [doc-id]◦ Map – emits term, doc-id for each doc split◦ Reduce – Identity Reducer – accumulates the (term, [doc-id,

doc-id ..]) Reverse Links – source -> target to target->

source◦ Map – emits (target, source) for each doc split◦ Reducer – Identity Reducer – accumulates the (target,

[source, source ..])

| 04/10/2023

Examples – Map Reduce Defn

Hides complexity of distributed computing

◦ Automatic parallelization of job◦ Automatic data chunking & distribution (via HDFS)◦ Data locality – MR task dispatched where data is◦ Fault tolerant to server, storage, N/W failures◦ Network and disk transfer optimization◦ Load balancing

| 04/10/2023

Map Reduce – Hadoop Implementation

| 04/10/2023

Hadoop Map Reduce Architecture

Very large files – block size 64 MB/128 MB Data access pattern - Write once read many Writes are large, create & append only Reads are large & streaming Commodity hardware Tolerant to failure – server, storage, network Highly available through transparent

replication Throughput is more important than latency

| 04/10/2023

HDFS Characteristics

| 04/10/2023

HDFS Architecture

Thanks

| 04/10/2023

| 04/10/2023

Backup Slides

| 04/10/2023

Map & Reduce Functions

| 04/10/2023

Job Configuration

Job Tracker tracks MR jobs – runs on master node

Task Tracker◦ Runs on data nodes and tracks Mapper, Reducer

tasks assigned to the node◦ Heartbeats to Job Tracker◦ Maintains and picks up tasks from a queue

| 04/10/2023

Hadoop Map Reduce Components

Name Node ◦ Manages the file system namespace and regulates access to

files by clients – stores meta data◦ Mapping of blocks to Data Nodes and replicas◦ Manage replication◦ Executes file system namespace operations like opening,

closing, and renaming files and directories. Data Node

◦ One per node, which manages local storage attached to the node

◦ Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes

◦ Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node.

| 04/10/2023

HDFS

an introduction to hadoop for large scale data analysis

Technology

data nodeone

word count map

hadoop map

data locality task

big data trends

set of data nodesresponsible

graphsconnected data

data parallel problemsdata