big data,map-reduce, hadoop. presentation overview what is big data? what is map-reduce?...

Big Data,Map-Reduce, Hadoop

Presentation Overview

What is Big Data?

What is map-reduce?

input/output data types

why is it useful and where is it used?

Execution overview

Features

fault tolerance

ordering guarantee

other perks and bonuses

Hands-on demonstration and follow-along

Map-reduce-merge

What is Big Data?

Large interconnected data.

Typically implies fault tolerant and load balanced.

Frequently open source.

Many smaller computers.

Clustered.

NoSQL, non-transactional,non-relational.

Read-Oriented.

Hadoop, Solr are leading players.

What is map-reduce?

Map-reduce is a programming model (and an associated implementation) for processing and generating large data sets.

It consists of two steps: map and reduce.

The “map” step takes a key/value pair and produces an intermediate key/value pair.

The “reduce” step takes a key and a list of the key's values and outputs the final key/value pair.

Types

map: (k1, v

1) → list(k

2, v

2)

reduce: (k2, list(v

2)) → list(v

2)

Why is this useful?

Map-reduce jobs are automatically parallelized.

Partial failure of the processing cluster is expected and tolerable.

Redundancy and fault-tolerance is built in, so the programmer doesn't have to worry.

It scales very well.

Many jobs are naturally expressible in the map/reduce paradigm.

What are some uses?

Word count

map: <word, 1>. reduce: <word, #>

Grep

map: <file, line>. reduce: identity

Inverted index

map: <word, docID>. reduce: <word, list(docID)>

Distributed sort (special case)

map: <key, record>. reduce: identity

Users: Google, Yahoo!, Amazon, Facebook, etc.


What is map-reduce?



Execution overview

Features

fault tolerance

ordering guarantee



Map-reduce-merge

Execution overview: map

The user begins a map-reduce job. One of the machines becomes the master.

Partition the input into M splits (16-64 MB each) and distribute among the machines. A worker reads his split and begins work. Upon completion, the worker notifies the master.

The master partitions the intermediate keyspace into R pieces with a partitioning function.

Execution overview: reduce

When a reduce worker is notified about a job, it uses RPC to read the intermediate data from a mapper, then sorts it by key.

The reducer processes its job, then writes its output to the final output file for its reduce partition.

When all reducers are finished, the master wakes up the user program.

What are M and R?

M is the number of map pieces. R is the number of reduce pieces.

Ideally, M and R are much larger than the number of workers. This allows one machine to perform many different tasks, improving load balancing and speeds up recovery.

The master makes O(M+R) scheduling decisions and keeps O(M*R) states in memory.

At least R files end up being written.

Example: counting words

We have UTD's fight song:

C-O-M-E-T-S! Go!

Green, Orange, White!

Comets! Go!

Strong of will, we fight for right!

Let's all show our comet might!

We want to count the number of occurrences of each word.

The next slides show the map and reduce phases.

First stage: map

Go through the input, and for each word return a tuple of (<word>, 1).

Output:

<C-O-M-E-T-S!, 1>

<Go!, 1>

<Green,, 1>

<Orange,, 1>

<White!, 1>

<Comets!, 1>

<Go!, 1>

<Strong, 1>

<of, 1>

...

Between map and reduce...

Between the mapper and the reducer, some gears turn within Hadoop, and it groups identical keys and sorts by key before starting the reducer.

Here's the output:

<C-O-M-E-T-S!, [1]>

<Comets!, [1]>

<Go!, [1,1]>

<Green,, [1]>

<Orange,, [1]>

<Strong, [1]>

<White!, [1]>

<of, [1]>

...

Second stage: reducer

The reducer receives the content, one key-valuelist pair at a time, and does its own processing.

For wordcount, it sums the values in each list.

Here's the output:

<C-O-M-E-T-S!, 1>

<Go!, 2>

<Green,, 1>

<Orange,, 1>

…

Then it writes these tuples to the final files in the HDFS.

How can we improve our wordcount?

Also, any questions?


What is map-reduce?



Execution overview

Features

fault tolerance

ordering guarantee



Map-reduce-merge

Fault tolerance

Worker failure is expected. If a worker fails during a map phase, its workload is reassigned to another worker. If a mapper fails during a reduce phase, both phases are re-executed.

Master failure is not expected, though checkpointing can be used for recovery.

If a particular record causes the mapper or reducer to reliably crash, the map-reduce system can figure this out, skip the record, and proceed.

Ordering guarantee

The implementation of map-reduce guarantees that within a given partition, the intermediate key/value pairs are processed in increasing key order.

This means that each reduce partition ends up with an output file sorted by key.

Partitioning function

By default, your reduce tasks will be distributed evenly by using a hash(intrmdt-key) mod N function.

You can specify a custom partitioning function.

Useful for locality reasons, such as if the key is a URL and you want all URLs belonging to a single host to be processed on a single machine.

Combiner function

After a map phase, the mapper transmits over the network the entire intermediate data file to the reducer.

Sometimes this file is highly compressible.

The user can specify a combiner function. It's just like a reduce function, except it's run by the mapper before passing the job to the reducer.

Counters

A counter can be associated with any action that a mapper or a reducer does. This is in addition to default counters such as the number of input and output key/value pairs processed.

A user can watch the counters in real time to see the progress of a job.

When the map/reduce job finishes, these counters are provided to the user program.


What is map-reduce?



Execution overview

Features

fault tolerance

ordering guarantee



Map-reduce-merge

What is ?

Hadoop is the implementation of the map/reduce design that we will use.

Hadoop is released under the Apache License 2.0, so it's open source.

Hadoop uses the Hadoop Distributed File System, HDFS. (In contrast to what we've seen with Lucene.)

Get the release from:

http://hadoop.apache.org/core/

Preparing Hadoop on your system

Configure passwordless public-key SSH on localhost

Configure Hadoop:

look at the two configuration files at http://utdallas.edu/~pmw033000/hadoop/

Format the HDFS:

bin/hadoop namenode -format

Start Hadoop:

cd <hadoop-dir>

bin/start-all.sh (and wait ≈20 seconds)

Example: grep

Standard Unix 'grep' behavior: run it on the command line with the search string as the first argument and the list of files or directories as the subsequent argument(s).

$ grep HelloWorld file1.c file2.c file3.c

file2.c:System.out.println(“I say HelloWorld!”);

$

Preparing for 'grep' in Hadoop

Hadoop's jobs always operate within the HDFS.

Hadoop will read its input from HDFS, and will write its output to HDFS.

Thus, to prepare:

Download a free electronic book:

http://utdallas.edu/~pmw033000/hadoop/book.txt

Load the file into HDFS:bin/hadoop fs -copyFromLocal book.txt /book.txt

Using 'grep' within Hadoop

bin/hadoop jar \

hadoop-0.18-2-examples.jar \

grep /book.txt /grep-result \

“search string”

bin/hadoop fs -ls /grep-result

bin/hadoop fs -cat /grep-result/part-00000

A good string to try: “Horace de \S+”

Between job runs: bin/hadoop fs -rmr /grep-result

How 'grep' in Hadoop works

The program runs two map/reduce jobs in sequence. The first job counts how many times a matching string occurred and the second job sorts matching strings by their frequency and stores the output in a single output file.

Each mapper of the first job takes a line as input and matches the user-provided regular expression against the line. It extracts all matching strings and emits (matching string, 1) pairs. Each reducer sums the frequencies of each matching string. The output is sequence files containing the matching string and count. The reduce phase is optimized by running a combiner that sums the frequency of strings from local map output. As a result it reduces the amount of data that needs to be shipped to a reduce task.

The second job takes the output of the first job as input. The mapper is an inverse map, while the reducer is an identity reducer. The number of reducers is one, so the output is stored in one file, and it is sorted by the count in a descending order. The output file is text, each line of which contains count and a matching string.

Another example: word count

bin/hadoop jar hadoop-0.18.2-examples.jar \

wordcount /book.txt /wc-result

bin/hadoop fs -cat /wc-result/part-00000 | \

sort -n -k 2

You can also try passing a “-r #” option to increase the number of parallel reducers.

Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining each word into a single record.


What is map-reduce?



Execution overview

Features

fault tolerance

ordering guarantee



Does map-reduce satisfy all needs?

Map-reduce is great for homogeneous data, such as grepping a large collection of files or word-counting a huge document.

Joining heterogeneous databases does not work well.

As is, we'd need additional map-reduce steps, such as map-reducing one database and reading from the others on the fly.

We want to support relational algebra.

Solution

The solution to these problems is: map-reduce-merge. It is map-reduce with a new additional merging step.

The merge phase makes it easier to process data relationships among heterogeneous data sets.

Types:

map: (k1, v

1)

α → [(k

2, v

2)]

α

reduce: (k2, [v

2])

α → (k

2, [v

3])

α (notice that the output [v] is a list)

merge: ((k2, [v

3])

α, (k

3, [v

4])

β) → (k

4, v

5)

γ

If α=β, then the merging step performs a self-merge (self-join in R.A.).

New terms

Partition selector: determines which data partitions produced by reducers should be retrieved for merging.

Processor: user-defined logic of processing data from an individual source.

Merger: user-defined logic of processing data merged from two sources where data satisfies a merge condition.

Configurable iterator: next slide.

Configurable iterators

The map and reduce user-defined functions get one iterator for the values.

The merge function gets two iterators, one for each data source.

The iterators do not have to move forward – they can be instrumented to do whatever the user wants.

Relational join algorithms have specific patterns for the merging step.

big data,map-reduce, hadoop. presentation overview what is big data? what is map-reduce?...

Documents

map step

types map

number of map pieces

intermediate data

big data

r pieces

r states

r files