big data,map-reduce, hadoop. presentation overview what is big data? what is map-reduce?...

35
Big Data,Map-Reduce, Hadoop

Upload: virgil-austin

Post on 29-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Big Data,Map-Reduce, Hadoop

Page 2: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Presentation Overview

What is Big Data?

What is map-reduce?

input/output data types

why is it useful and where is it used?

Execution overview

Features

fault tolerance

ordering guarantee

other perks and bonuses

Hands-on demonstration and follow-along

Map-reduce-merge

Page 3: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

What is Big Data?

Large interconnected data.

Typically implies fault tolerant and load balanced.

Frequently open source.

Many smaller computers.

Clustered.

NoSQL, non-transactional,non-relational.

Read-Oriented.

Hadoop, Solr are leading players.

Page 4: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

What is map-reduce?

Map-reduce is a programming model (and an associated implementation) for processing and generating large data sets.

It consists of two steps: map and reduce.

The “map” step takes a key/value pair and produces an intermediate key/value pair.

The “reduce” step takes a key and a list of the key's values and outputs the final key/value pair.

Page 5: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Types

map: (k1, v

1) → list(k

2, v

2)

reduce: (k2, list(v

2)) → list(v

2)

Page 6: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Why is this useful?

Map-reduce jobs are automatically parallelized.

Partial failure of the processing cluster is expected and tolerable.

Redundancy and fault-tolerance is built in, so the programmer doesn't have to worry.

It scales very well.

Many jobs are naturally expressible in the map/reduce paradigm.

Page 7: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

What are some uses?

Word count

map: <word, 1>. reduce: <word, #>

Grep

map: <file, line>. reduce: identity

Inverted index

map: <word, docID>. reduce: <word, list(docID)>

Distributed sort (special case)

map: <key, record>. reduce: identity

Users: Google, Yahoo!, Amazon, Facebook, etc.

Page 8: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Presentation Overview

What is map-reduce?

input/output data types

why is it useful and where is it used?

Execution overview

Features

fault tolerance

ordering guarantee

other perks and bonuses

Hands-on demonstration and follow-along

Map-reduce-merge

Page 9: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Execution overview: map

The user begins a map-reduce job. One of the machines becomes the master.

Partition the input into M splits (16-64 MB each) and distribute among the machines. A worker reads his split and begins work. Upon completion, the worker notifies the master.

The master partitions the intermediate keyspace into R pieces with a partitioning function.

Page 10: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Execution overview: reduce

When a reduce worker is notified about a job, it uses RPC to read the intermediate data from a mapper, then sorts it by key.

The reducer processes its job, then writes its output to the final output file for its reduce partition.

When all reducers are finished, the master wakes up the user program.

Page 11: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

What are M and R?

M is the number of map pieces. R is the number of reduce pieces.

Ideally, M and R are much larger than the number of workers. This allows one machine to perform many different tasks, improving load balancing and speeds up recovery.

The master makes O(M+R) scheduling decisions and keeps O(M*R) states in memory.

At least R files end up being written.

Page 12: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Example: counting words

We have UTD's fight song:

C-O-M-E-T-S! Go!

Green, Orange, White!

Comets! Go!

Strong of will, we fight for right!

Let's all show our comet might!

We want to count the number of occurrences of each word.

The next slides show the map and reduce phases.

Page 13: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

First stage: map

Go through the input, and for each word return a tuple of (<word>, 1).

Output:

<C-O-M-E-T-S!, 1>

<Go!, 1>

<Green,, 1>

<Orange,, 1>

<White!, 1>

<Comets!, 1>

<Go!, 1>

<Strong, 1>

<of, 1>

...

Page 14: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Between map and reduce...

Between the mapper and the reducer, some gears turn within Hadoop, and it groups identical keys and sorts by key before starting the reducer.

Here's the output:

<C-O-M-E-T-S!, [1]>

<Comets!, [1]>

<Go!, [1,1]>

<Green,, [1]>

<Orange,, [1]>

<Strong, [1]>

<White!, [1]>

<of, [1]>

...

Page 15: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Second stage: reducer

The reducer receives the content, one key-valuelist pair at a time, and does its own processing.

For wordcount, it sums the values in each list.

Here's the output:

<C-O-M-E-T-S!, 1>

<Go!, 2>

<Green,, 1>

<Orange,, 1>

Then it writes these tuples to the final files in the HDFS.

Page 16: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

How can we improve our wordcount?

Also, any questions?

Page 17: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Presentation Overview

What is map-reduce?

input/output data types

why is it useful and where is it used?

Execution overview

Features

fault tolerance

ordering guarantee

other perks and bonuses

Hands-on demonstration and follow-along

Map-reduce-merge

Page 18: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Fault tolerance

Worker failure is expected. If a worker fails during a map phase, its workload is reassigned to another worker. If a mapper fails during a reduce phase, both phases are re-executed.

Master failure is not expected, though checkpointing can be used for recovery.

If a particular record causes the mapper or reducer to reliably crash, the map-reduce system can figure this out, skip the record, and proceed.

Page 19: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Ordering guarantee

The implementation of map-reduce guarantees that within a given partition, the intermediate key/value pairs are processed in increasing key order.

This means that each reduce partition ends up with an output file sorted by key.

Page 20: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Partitioning function

By default, your reduce tasks will be distributed evenly by using a hash(intrmdt-key) mod N function.

You can specify a custom partitioning function.

Useful for locality reasons, such as if the key is a URL and you want all URLs belonging to a single host to be processed on a single machine.

Page 21: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Combiner function

After a map phase, the mapper transmits over the network the entire intermediate data file to the reducer.

Sometimes this file is highly compressible.

The user can specify a combiner function. It's just like a reduce function, except it's run by the mapper before passing the job to the reducer.

Page 22: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Counters

A counter can be associated with any action that a mapper or a reducer does. This is in addition to default counters such as the number of input and output key/value pairs processed.

A user can watch the counters in real time to see the progress of a job.

When the map/reduce job finishes, these counters are provided to the user program.

Page 23: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Presentation Overview

What is map-reduce?

input/output data types

why is it useful and where is it used?

Execution overview

Features

fault tolerance

ordering guarantee

other perks and bonuses

Hands-on demonstration and follow-along

Map-reduce-merge

Page 24: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

What is ?

Hadoop is the implementation of the map/reduce design that we will use.

Hadoop is released under the Apache License 2.0, so it's open source.

Hadoop uses the Hadoop Distributed File System, HDFS. (In contrast to what we've seen with Lucene.)

Get the release from:

http://hadoop.apache.org/core/

Page 25: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Preparing Hadoop on your system

Configure passwordless public-key SSH on localhost

Configure Hadoop:

look at the two configuration files at http://utdallas.edu/~pmw033000/hadoop/

Format the HDFS:

bin/hadoop namenode -format

Start Hadoop:

cd <hadoop-dir>

bin/start-all.sh (and wait ≈20 seconds)

Page 26: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Example: grep

Standard Unix 'grep' behavior: run it on the command line with the search string as the first argument and the list of files or directories as the subsequent argument(s).

$ grep HelloWorld file1.c file2.c file3.c

file2.c:System.out.println(“I say HelloWorld!”);

$

Page 27: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Preparing for 'grep' in Hadoop

Hadoop's jobs always operate within the HDFS.

Hadoop will read its input from HDFS, and will write its output to HDFS.

Thus, to prepare:

Download a free electronic book:

http://utdallas.edu/~pmw033000/hadoop/book.txt

Load the file into HDFS:bin/hadoop fs -copyFromLocal book.txt /book.txt

Page 28: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Using 'grep' within Hadoop

bin/hadoop jar \

hadoop-0.18-2-examples.jar \

grep /book.txt /grep-result \

“search string”

bin/hadoop fs -ls /grep-result

bin/hadoop fs -cat /grep-result/part-00000

A good string to try: “Horace de \S+”

Between job runs: bin/hadoop fs -rmr /grep-result

Page 29: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

How 'grep' in Hadoop works

The program runs two map/reduce jobs in sequence. The first job counts how many times a matching string occurred and the second job sorts matching strings by their frequency and stores the output in a single output file.

Each mapper of the first job takes a line as input and matches the user-provided regular expression against the line. It extracts all matching strings and emits (matching string, 1) pairs. Each reducer sums the frequencies of each matching string. The output is sequence files containing the matching string and count. The reduce phase is optimized by running a combiner that sums the frequency of strings from local map output. As a result it reduces the amount of data that needs to be shipped to a reduce task.

The second job takes the output of the first job as input. The mapper is an inverse map, while the reducer is an identity reducer. The number of reducers is one, so the output is stored in one file, and it is sorted by the count in a descending order. The output file is text, each line of which contains count and a matching string.

Page 30: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Another example: word count

bin/hadoop jar hadoop-0.18.2-examples.jar \

wordcount /book.txt /wc-result

bin/hadoop fs -cat /wc-result/part-00000 | \

sort -n -k 2

You can also try passing a “-r #” option to increase the number of parallel reducers.

Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining each word into a single record.

Page 31: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Presentation Overview

What is map-reduce?

input/output data types

why is it useful and where is it used?

Execution overview

Features

fault tolerance

ordering guarantee

other perks and bonuses

Hands-on demonstration and follow-along

Page 32: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Does map-reduce satisfy all needs?

Map-reduce is great for homogeneous data, such as grepping a large collection of files or word-counting a huge document.

Joining heterogeneous databases does not work well.

As is, we'd need additional map-reduce steps, such as map-reducing one database and reading from the others on the fly.

We want to support relational algebra.

Page 33: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Solution

The solution to these problems is: map-reduce-merge. It is map-reduce with a new additional merging step.

The merge phase makes it easier to process data relationships among heterogeneous data sets.

Types:

map: (k1, v

1)

α → [(k

2, v

2)]

α

reduce: (k2, [v

2])

α → (k

2, [v

3])

α (notice that the output [v] is a list)

merge: ((k2, [v

3])

α, (k

3, [v

4])

β) → (k

4, v

5)

γ

If α=β, then the merging step performs a self-merge (self-join in R.A.).

Page 34: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

New terms

Partition selector: determines which data partitions produced by reducers should be retrieved for merging.

Processor: user-defined logic of processing data from an individual source.

Merger: user-defined logic of processing data merged from two sources where data satisfies a merge condition.

Configurable iterator: next slide.

Page 35: Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Configurable iterators

The map and reduce user-defined functions get one iterator for the values.

The merge function gets two iterators, one for each data source.

The iterators do not have to move forward – they can be instrumented to do whatever the user wants.

Relational join algorithms have specific patterns for the merging step.