shuffle sort 101

When buffer exceeds io.sort.spill.pct, a spill thread begins. The spill thread begins with the start of the buffer and starts to spill keys and values to disk. If the buffer fills up before spill is complete, it blocks the mapper until the spill is complete. The spill is complete when the buffer is completely flushed. The mapper then continues to fill the buffer until another spill begins. It loops like this until the mapper has emitted all of its K,V pairs.

A larger value for io.sort.mb means more k,v pairs can fit in memory so you experience fewer spills. Changing io.sort.spill.pct can give the spill thread more time so you experience fewer blocks.

5

Another threshold parameter is io.sort.record.percent. The buffer is divided by this fraction to leave room for accounting info that is required for each record. If the accounting info room fills up, a spill begins. The amount of room required by accounting info is a function of the number of records, not the record size. Therefore, a higher number of records might need more room for accounting to reduce spill.

6

From MAPREDUCE-64.

The point here is that the buffer is actually a circular data structure with two parts:the key/value index and the buffer. The key/value index is the “accounting info”.

MAPREDUCE-64 basically patches such that that sort.record.percent autotunesinstead of get manually set.

7

This is a diagram of a single spill. The result is a partitioned, possibly-combined spill file sitting in one of the locations of mapred.local.dir on local disk.

This is a “hot path” in the code. Spills happen often and there are insertion points for user/developer code: specifically the partitioner but more importantly the combiner and most importantly the keycomparator and also the valuegroupingcomparator. If you don’t include a combiner or you have an ineffective combiner, then you’re spilling more data through the entire cycle. If your comparators are less than efficient, your whole sort process slows.

8

This illustrates how a tasktracker’s mapred.local.dir might look towards the end of a particular map task that is processing a large volume of data. Spill files are dumped to disk round-robin to each directory specified by mapred.local.dir. Each spill file is partitioned and sorted with the context of a single RAM-sized chunk of data.

Before those files can be served to the reducers, they have to be merged. But how do you merge files that are already about as large as a buffer?

9

The good news is that it’s computationally very inexpensive to merge sorted sets to produce a final sorted set. However, it is very IO intensive.

This slide illustrates the spill/merge cycle required to merge the multiple spill files into a single output file ready to be served to the reducer. This example illustrates the relationship between io.sort.factor (2 for illustration) and the number of merges. The smaller io.sort.factor is, the more merges and spills are required, the more disk IO you have, the slower your job runs. The larger it is, the more memory is required, but the faster things go. A developer can tweak these settings per job, and it’s very important to do so, because it directly affects the IO characteristics (and thus performance) of your mapreduce job.

In real life, io.sort.factor defaults to 10, and this still leads to too many spills and merges when data really scales. You can increase io.sort.factor to 100 or more on large clusters or big data sets.

10

In this crude illustration, we’ve increased io.sort.factor to 3 from 2. In this case, we cut the number of merges required to achieve the same result in half. This cuts down the number of spills, the number of times the combiner is called, and one full pass through the entire data set. As you can see, io.sort.factor is a very important parameter!

11

Reducers obtain data from mappers via HTTP calls. Each HTTP connection has to be serviced by an HTTP thread. The number of HTTP threads running on a task tracker dictates the number of parallel reducers we can connect to. For illustration purposes here, we set the value to 1 and watch all the other reducers queue up. This slows things down.

12

Increasing the number of HTTP threads increases the amount of parallelism we can achieve in the shuffle-sort phase, transferring data to the reducers.

13

Reducers obtain data from mappers via HTTP calls. Each HTTP connection has to be serviced by an HTTP thread. The number of HTTP threads running on a task tracker dictates the number of parallel reducers we can connect to. For illustration purposes here, we set the value to 1 and watch all the other reducers queue up. This slows things down.

15

The parallel copies configuration allows the readucer to retrieve map output from multiple mappers out in the cluster in parallel.

If the reducer experiences a connection failure to a mapper, it tries again, exponentially backing off in a loop until the value of mapred.reduce.copy.backoff is exceeded. Then we timeout and fail that reducer.

16

“That which is written must be read”

In a very similar process to which map output is spilled and merged to create a final output file for the mapper, the output from multiple mappers must be read, merged, and spilled to create the input for the reduce function. The final reducer output is not written to disk in the form of a spill file, but is rather passed to reduce() as a parameter.

This means that if you have a mistake or a misconfiguration that is slowing you down on the map side, the same exact configuration mistake is slowing you down double on the reduce side. When you don’t have combiners in the mix that are reducing the number of map outputs, this problem is compounded.

17

Suppose K is really a composite key that can be expanded into fields K1, K2…Kn For the mapper, we set the SortComparator to respect ALL parts of that key.

However, for the reducer, we call a “grouping comparator” which only respects a SUBSET of those keys. All keys being equal by this subset are sent to the same call to reduce().

The result is that keys that are equal by the “grouping comparator” go to the same call to “reduce” with their associated values, which have already been sorted by the more precise key.

18

This slide illustrates the secondary sort process independently of the shuffle-sort. The sortComparator orders every key/value set. The grouping comparator just determines equivalence in terms of which calls to reduce() get which data elements. The cheat here is that the grouping comparator has to respect the rules of the sort comparator. It can only be less restrictive. In other words, values that appear equal to the group comparator will go to the same call to reduce(). The value grouping does not actually reorder any values.

19

In this crude illustration, we’ve increased io.sort.factor to 3 from 2. In this case, we cut the number of merges required to achieve the same result in half. This cuts down the number of spills, and one full pass through the entire data set. As you can see, io.sort.factor is a very important parameter!

20

The size of the reducer buffer is specified by mapred.job.shuffle.input.buffer.pct in terms of percent of the total heap allocated to the reduce task. When this buffer fills, map inputs spill to disk and have to be merged later. The spill begins when the mapred.shuffle.merge.pct threshhold is reached: this is specified in terms of the precent of the input buffer size. You can increase this value to reduce the number of trips to disk in the reduce() phase.

Another paramter to pay attention to is mapred.inmem.merge.threshhold. This is in terms of the number of map input values. When this value is reached, we spill to disk. If your mappers explode the data like wordcount does, consider setting this value to zero.

21

In addition to being a little funny, the point here is that while there are a lot of tunables to consider in Hadoop, you really only need to focus on a few at a time in order to get optimum performance of any specific job.

Cluster administrators typically set default values for these tunables, but really these are best guesses based on their understanding of Hadoop and of the jobs the users will be submitting to the cluster. Any user can submit a job that cripples a cluster, and in the interests of themselves and the other users, it behooves developer to understand and override these configurations.

22

These numbers will grow with scale but the ratios will remain the same. Therefore, you should be able to tune your mapreduce job on small data sets before unleashing them on large data sets.

26

Start with a naïve implementation of wordcount with no combiner, and tune down io.sort.mb and io.sort.factor to very small levels. Run with this setting on a very small data set. Then run again on a data set twice the size. Now, tune up io.sort.mb and/or io.sort.factor. Also play with mapred.inmem.merge.threshhold.

Now, add a combiner.

Now, tweak the wordcount to keep a local in-memory hash updated. This causes more memory consumption in the mapper, but reduces the data set going into combine() and also reduces the amount of data spilled.

One each run, note the counters. What works best for you?

28

shuffle sort 101

Technology

single spill

spill thread

multiple spill files

combined spill file

disk io

accounting info room

io intensive

smaller io