Is Parallelism Hard or Not?

Butler Lampson helped create threading at Xerox PARC in the 1970s. He concedes that forty years of evidence have shown threading doesn't work. Nobody threads anything if they can help it.

Lampson describes the "wizard" process of multithreading. If you absolutely must thread something, find a wizard. The wizard crams all the parallelism into a box. Only the wizard opens the box. If a big project has any parallelization at all, it usually works like this.

Because programs are so rarely multithreaded, whenever they become mulithreaded, it's cause for celebration. FFmpeg ran essentially single-threaded for ten years. When Alexander Strange added frame-level multithreading, it was a big deal. Despite this achievement, today most of the encoding parts and all of the audio parts of FFmpeg are single-threaded.

Some Previous Directions

Threading is not the only parallelization model. Linux harbors multiple competing models, most of them effectively abandonded. The other Linux models descend from the idea that inter-process concurrency is good enough to be the only concurrency. Because the operating system must already schedule processes in parallel, it's easy to create a sub-process which runs parallel to a parent. This observation leads to fork, which is a sometimes elegant, sometimes exasperating method.

The fork method clones the process at a point in time. Writes inside of the clone are private by default, and so the clone must communicate with the parent through pipes, which introduces an unwanted server-client layer; or through shared memory, using either System V or POSIX style allocations; or through some other increasingly obscure method of inter-process communication.

The main difference between threading and forking is that threading shares memory by default while forking keeps it private. Reaching the correct behavior from threading requires guarding against untimely access, while from forking it requires sharing the appropriate information, and then guarding that against untimely access. When the amount of information to be shared is zero, there's a clear advantage over threading. Forking wins when the processes don't have much to say to each other.

In terms of deployment, threads have won, though this can hardly be viewed as an endorsement of the model. Threads require less from the operating system layer, both in terms of permissions and development, and so it is easier to provide parallelism via threads. The ability to spawn threads and spin mutexes is enough to be "parallel-complete", and because of that many platforms stop here. In truth, neither threads nor forks are very good, because they both adopt the attitude of providing the minimal amount of parallel assistance. Not having to write assembly is the whole point of C, but the threading primitives available inside C are at or below the assembly level for parallelism. Here are some AND gates and some NOT gates, go write FizzBuzz.

Where We're at Now

Parallelism is in a bad place right now. Additional transistors aren't making faster chips, they're making additional cores, which aren't being used. A few different barriers ensure these cores will be dark much of the time: the inescapably single-threaded nature of some program logic, memory throughput limitations, etc. So the goal for parallelism is not to peg all cores at all times, but to power on the dark parts of the chip whenever expedient. Still we're failing miserably at that.

I am optimistic that there is a way to make this kind of parallelism work. First, I don't have any sense that there are further barriers that make this reduced problem impossible: usually when something is impossible the barriers become apparent quickly. Second, we have seen substantial recent progress towards better parallelism, and I don't see any indication of that potential being exhausted.

The most compelling argument that there is work to be done comes in the form of Apple's Grand Central Dispatch (2009). Getting into the details here is out of scope, but we can give an overview. GCD imposes a queue structure on parallelism. There are two types of queue: synchronous and asynchronous, and they execute jobs in the corresponding fashion. Right away this solves the majority of easy parallel tasks (uploading an image in the background of the UI thread, for instance). What makes this simple sounding mechanism powerful is that jobs can grab locks from inside the queue lanes to alter the flow of traffic. This framework for reasoning about parallelism makes concurrent programming a lot easier, maybe by an order of magnitude. In several years of iOS programming every nasty parallel problem I faced reduced to an uncomplicated queue diagram. Code blocks and other niceties remove a lot of unimportant details. That said, GCD is a low-level solution to the problem.

There is another recent language-level approach to the problem that demonstrates there's more ground to cover. In Kerf we call it mapcores. As far as I know, the technique comes from a suggestion Jake Loveless made to Arthur Whitney (around 2005), which was to parallelize the APL each operator. It is essentially a parallel-for, but what is remarkable about it is how easily it converts serial code to parallel. You only have to add the mapcores token and then the lambda applies to the argument's items in parallel, and the results are collected with no side effects. All the details of doing so are obviated. This technique again scoops up many simple parallelizable tasks (such as simultaneous reading from connected disks). It is an important step towards parallelizing more code quicker, both in userspace and in the internals. By writing certain Kerf primitives in Kerf with mapcores, we immediately parallelize any future invocation of that primitive. Because the user doesn't have to think about it, this gets us closer to automatic parallelization.

Go's parallelism model in goroutines is worth mentioning (2009), since it introduces many useful improvements over basic pthreads and an interesting discipline for handling them. Still, the philosophy here is that some model based on threads is the right way to go about things, and I'm not convinced that's the best we can do. What Go should be commended for is for taking parallelization into account from the very beginning.

Most languages are designed without any parallelism in mind at all, and as a result they repeat the same pattern of tacking it on later. It's hard to measure the negative impact this has had on parallelization techniques, but it must be substantial. If nothing else it shows few language designers are making serious attempts at solving the problem.

What's Next?

I don't know what the solution to the parallel problem is, but I do have some ideas about what it's liable to look like.

The solution will optimize for the case of c cores, where c is small, say 4 or 32. There is a lot of theoretical research on massively-parallel machines (big c), but none of this applies to CPUs in widespread use today. There are interesting things you could do if the standard architecture was modeled on the Connection Machine, but that just isn't the case. GPUs resemble this but so far don't seem to be a way forward.

I don't think that the solution will ever be fully automatic. There are certainly many things that could be parallelized automatically (or via switch): device-level pipelining is one, for-loops are another, and we should aim to get as many of these as we can. But some problems will resist this approach, if for no other reason than by having multiple solutions. Should I read a matrix by row or by column? The order matters. So the conclusion is that an approach must make it easy for the programmer to implement the solutions he already sees. This implies a framework or a discipline or a new way of thinking that the programmer can apply.

The solution won't come in the form of a new parallel language: nobody sits down with a parallel application in mind (they sit down to write a Pokémon videogame or something else specific), and the benefits from parallelization are unlikely to be large enough to justify an industry shift from say C to a brand new language. So the fix would probably look like an operating system improvement, or a revision of pthreads, or a library like GCD, or a compiler extension, or another idea that will play nicely with the existing code infrastructure. Perhaps we'd shoot for something that would eventually be incorporated into language standards. This idea could even be drawn from a parallel language.

Our field has bypassed parallel programming in favor of distributed computing, where several approaches have real traction. If the trend continues long enough, and the distributed ecosystem matures enough, what we might see is this methodology filtering back into the hardware side of things. It's completely feasible, from a programming standpoint, to imagine an architecture where each core has its own private RAM, the cores communicate through perfectly reliable IPC, and the device runs applications written in a successor to the distributed computing technology being developed now. Whether such an architecture could make economic or practical sense, I wouldn't know. If we delay long enough, the need for a parallel solution could disappear temporarily. But this isn't guaranteed, and we're compelled to look anyway. Who's to say it won't be cost effective to put multiple cores on each section of private RAM?