# Mapreduce Fun: Sampling for Large Data Set | BloomReach Engineering Blog

This post is by Chou-han Yang, principal engineer at BloomReach.

The coolest thing about mapreduce is that we suddenly have enormous computing power and storage at disposal. To me, it’s like a kid who suddenly has a new toy and a desire to incorporate it into his favorite games. What could be more fun than figuring out new ways to play old games?

It reminds me of something I do frequently with single-process programming, something that needs an overhaul to fit into the mapreduce framework. Hadoop in particular also needs some careful tuning in order to run smoothly.

Let’s take a simple example here: sampling n elements from a very large data set with unknown size. I would like walk through all methods even from non-mapreduce to mapreduce so we can look at the pros and cons. First, we do want to make sure our sampling process is not bias, i.e. that every element should have the same probability to be chosen into sampled set. And of course, we hope to achieve with good time and space complexity.

#### Method 1 (non-mapreduce): Shuffle a deck of cards

We can scan entire data set and get the number of elements m, and then we can generate a random number ranging from 0 to m-1. This would allow us to pick the first sample element. The result is not biased assuming the random number generator gives same probability for each number from 0 to m-1. Then, we can remove the element, and do it again with a size m-1 data set.

Of course, the problem here is that we have to know the size of the data set first. We also need random access to any element when it’s chosen, regardless if we load entire data set into memory.

#### Method 2 (non-mapreduce): Reservoir sampling

Given the problems with Method 1, engineers developed new way to sample data online. The question is: If we know the current data size, m, can we decide the probability of the next element, m+1, being chosen into the sampled set?

The answer is obviously yes. As the size of the data set grows as we read new elements, the probability of the new element being sampled decreases proportionally with the size of the data set already seen. By applying reservoir sampling, we can scan very large data set, and get a sample online while we scan every new element.

This is very useful for data streams as well as large data set. But again, the downside is that all data needs to be processed in a single thread or process, which may not provide any benefit if we move to more powerful mapreduce framework.

#### Method 3 (mapreduce): Try to move reservoir sampling to mapreduce framework

Now that we are entering the age of big data, so everyone is using mapreduce like crazy. Natually your first intuition is to employ the same reservoir sampling technique because we don’t want to keep large data set in memory. Not to mention, the mapreduce framework fits better with streaming model as you get one element at a time in mapper or reducer.

So let’s run reservoir sampling in both mapper and reducer. Sounds easy, right? Well, it doesn’t really work the way you might have expected. The problem is that each mapper might not get exactly the same number of elements. That means those elements in a smaller split will have a better chance to be sampled.

Using an identity mapper and dumping everything into a single reduce, so it can do reservior sampling, is hardly ideal. Since we’re putting all the work loads into a single reducer, mapreduce isn’t giving us any benefit. Essentially, this method is the same as processing all the data in a single node.

We may be able to workaround the problem by either controlling the number of elements in each split or by doing some math tricks that remove the probability bias in the reducer. But that’s all very complicated.

#### Method 4 (mapreduce): Revisit data shuffling?

So back to square one. For mapreduce jobs, we can definitely get the total number of elements in data set by scanning the data set. Then we can randomly pick elements in a second pass. This method works, but again it’s not very efficient because we have to scan the data set twice.

So, let’s think agin about how shuffling a deck of cards works. First, we shuffle the cards, which is a randomization process. Secondly, we pick the top n cards as the sampling set, which is simply pick top n elements with some order.

So is it possible that we can shuffle data without moving actual data around? The answer is yes. We can simply apply a hash function to the element. Therefore, we can get a randomized order for each element. At first this doesn’t seem to simplify the problem, because we’re just converting the sampling problem into a data shuffling problem. But consider the fact that the second step, picking the top n element, coud be easily run by mappers with a heap in it.

In short, we hash each element for each mapper and get a random order. Then we use the value to determine if we want to keep it in the heap for the entire split for the entire map task. When the map task is finished, we simply emit the entire heap to the reducer.

So in this case, we still have a single reducer, but the number of elements that goes to the reducer is dramatically reduced. The actually number of elements sending to reducer will be the number of mappers times n.

#### Conclusion

Like a new frameworks, it is fun to re-think the best approach to an old problem. Mapreduce basically opens the door for more functional constructs and you might sometimes need to “unlearn” things in order to full utilize the power of a framework. There may be even better solutions that we didn’t discovered yet. So let us know what you think, and have fun with mapreduce.