In Part One, we talked about the key dilemma of programming Python in a concurrent world. In a nutshell, for CPU-bound tasks, Python threads will not help. So what do we do instead?
Broadly speaking, there are two kinds of real-world concurrency problems you will face:
In this essay, you're going to learn about the first one: tasks that ARE empowered by having extra CPUs. We do this by writing Python code to run across several processes. In the best case, having N CPUs will let you run the program N times faster.
(The second kind, which we will cover in part 3, has to do with juggling multiple tasks that are not CPU-bound. For something like this, having multiple CPUs or cores doesn't really help. To win that game, you have to get the absolute most you can out of a single thread... which modern Python gives you some great tools for.)
Among Python programmers, one thing separating the good from the great is an understanding of the concurrency primitives provided by modern operating systems - processes and threads - and the different patterns for working with them. This is fundamentally independent of language. You may know some of it from your formal education, or your experience implementing concurrent applications in other languages.
Broadly, there are two main approaches to implementing concurrent systems: Shared Memory, and Message Passing. The key difference: shared memory means your program is designed for two or more threads to read from and write to the same block of memory - dancing without stepping on each other's toes. In the message passing approach, any chunk of memory is only accessed by a single thread. When two or more need to synchronize, interact or communicate, data is sent from one to another - i.e., a message is passed.
Now, the two threads can be in the same process, or in two
different processes - each of which has its own main thread. So you can
use shared memory or message passing in either.1 However, typically, people will used
shared memory only in single-process, multi-threaded systems. It's
possible to walk the shared memory path via
constructs or IPC for a multiprocess program, but is not as
performant, and can be complex to implement.2
I recommend you not go down that path, though, if you can avoid it. Those of you who have done it know why: it's very difficult to do well, and too easy to create subtle race conditions, deadlocks, and other bugs. Sometimes you can't avoid sharing memory. But if you can use message-passing instead, do it.
For multiprocess programs, message passing tends to be a more natural fit anyway. For two reasons: first, shared memory is not so performant and elegant in multiprocess, compared to doing so in a single-process, multithreaded program. More importantly, many of the kinds of programs you'd want to implement using multiple processes lend themselves naturally to a message-passing architecture, especially in Python.
Python essentially forces you to use multiple processes to leverage more than one CPU. 3 There are ways to share memory across processes, and modern Python makes it relatively easy to do. But we'll focus on the message-passing approach here.
We really have it easy now. Once upon a time, if you wanted to make
Python use multiple CPUs well, you had to do something horrible, hacky
and unportable using
os.fork. But now, we have the nice
multiprocessing module in the standard library. It's a
nice Pythonic interface to working with multiple processes, making
many hard things easy4. There are
plenty of articles out there with nice, neat toy examples. But how
about we look at something more realistic?
Let me introduce you to Thumper. It's a python program that generates thumbnails of gigantic image libraries. This is what you want to use if your webapp fetches a new batch of 100,000 images at 2am each night, and you need thumbnails of them ready to view in a reasonable amount of time. This is a perfect use case for multiple CPUs: generating a thumbnail is CPU-bound, and the calculation is completely independent for two different images.
Feast your eyes on this beauty:
This comes from repeatedly running thumper on a data set of NASA TIFF images, on an AWS c3.2xlarge instance. The horizontal is the number of worker processes spawned.
There's a LOT going on in this graph. I'm working on an Advanced Python Mastery course, which will delve into much of it, granting my students amazing Batman-like concurrent programming superpowers. For now, we are going to focus on Thumper's implementation - and how it bypasses the GIL to scale with the number of available CPU cores.
The core is quite simple, thanks to Pillow - the modern, maintained fork of PIL (Python Imaging Library):
# PIL is actually Pillow. Confusing, I know,
# but nicely backwards compatible.
from PIL import Image
def create_thumbnail(src_path, dest_path, thumbnail_width, thumbnail_height):
image = Image.open(src_path)
Very straightforward. Now, we want to farm this out to several worker processes. Here are some considerations:
multiprocessing module has a very useful
abstraction, in the form of the
Pool class. You create it
pool = multiprocessing.Pool(num_processes)
Whew, that was hard! The word "multiprocessing" has way too many
characters... so much typing. Thankfully that's behind us, and what's
left is to dispatch images to each of these
Pool provides several different
ways to do this. Each takes a callable - which will be
create_thumbnail, in our case - and its
Some of the different pool methods are:apply_async Takes a function and some arguments, and sends it to one of the worker processes to be run. Replies immediately with a result object, which is kind of like a future - you can get the returned value from it once it's done. map and map_async Similar to Python's built-in
map()function, except the work is farmed out over the child processes.
mapwill block until all jobs are done;
map_asyncimmediately returns a result object. Limitation: the callable can only take one argument. imap and imap_async Like
map_async, but returns an iterator instead of a completed sequence. Could have been called
lazy_map. starmap and starmap_async Like
map_async, except its callable can take many arguments.
These are most5
of the methods you will use, and as you can see, they are mostly
variants of similar ideas. Since our
function takes multiple arguments, Thumper uses
# src_dir is the folder containing full-size images.
# dest_dir is where we are going to write the thumbnails,
# in a parallel file hierarchy.
for (dirpath, dirnames, filenames) in os.walk(src_dir):
for filename in filenames:
src_path = os.path.join(dirpath, filename)
dest_path = find_dest_path(src_dir, dest_dir, src_path)
yield (src_path, dest_path, thumbnail_width, thumbnail_height)
As you can infer, the second argument of
startmap_async is an iterable. Since we want Thumper to
literally work with millions of images, I coded a memory-efficient
generator that creates the arugment tuples as needed, rather than
calculating a huge list (one tuple for each image).
How well does it work? It actually takes some thought at first to understand. Imagine you are using a machine with eight CPUs (or cores). How many will you want Thumper to use? Theoretically, it is going to be eight - the number of cores. It may be another value, depending on (a) what else is going on in the system, and (b) what's going on in your application. It's common to find that you'll get the exact same performance using fewer processes than the number of CPUs on the system. (That means you've hit a point where raw CPU is no longer the bottleneck.) And believe it or not, I've also seen cases where using too many workers can cause contention for resources and slow things down.
In my own experience, the only way to know for sure is to test: repeating execution varying only the number of CPUs exercised. That's what I did above. Let's focus on elapsed time, because minimizing that is the whole point:
This was done on a c3.2xlarge EC2 instance, a hefty fellow with eight CPUs. The horizontal axis is the number of worker processes spawned by Thumper; the vertical is the total elapsed time to thumbnail all images, so lower is better.
Looking at the graph, as you add worker processes, elapsed time keeps decreasing until you hit six. Why not eight? Or seven? There are many possible reasons, which can be very application specific. In general, you have scaled out as much as you can via CPU, and have now hit another bottleneck. Perhaps you are using all available memory, and have started paging. Perhaps you are now saturating CPU cache. Perhaps you are hitting an I/O throughput limit, where you can't load any more bytes of images per second.
As you can see, learning the classes, methods and functions
multiprocessing is only the first step.
For programming in general - in any language - scaling out across CPUs will only get you so far. Plus, throwing more CPUs at a problem gets expensive. If you really want to get the most out of the machine, you also have to master getting the most out of a single thread. And that's what we are going to talk about in part 3. The main key here - the one secret that will let you really kick a single thread's butt - is...
Well, I'll tell you when I publish part 3. Make sure you're subscribed to get the announcement.