How Many Cores Do We Need?

Anwar Ghuloum is a Principal Engineer with Intel’s Microprocessor Technology Lab, and recently wrote a blog post titled Unwelcome Advice. He proposes that developers should start thinking now about using hundreds or thousands of cores.This has gotten some web coverage at places like slashdot.

In the film industry we do have many problems that parallelize very well. Physical simulations is one example that people are often familiar with using parallel processing. Rendering is an extremely good example because we can often get parallelism not only within a single image, but because a film is made up of many images, we get additional parallelization by processing multiple images at the same time. My company, Digital Ordnance, builds a high performance image capture and play back system. We take advantage of both CPU and GPU parallelism quite heavily in our systems.

But, the thing that struck me about the post, is that concept that large numbers of cores are applicable to a wide audience. I’m really not convinced that’s true.

My biggest complaint is that for the typical home or business users computers are faster than they need to be. Most users are surfing the web, reading email, running some office applications, and maybe playing some light games. They’re basically doing one task that’s limited in performance by human input. For them pretty much any modern computer has more than enough power. They’d be better suited by having reliable disk storage (RAID), better screens, or cheaper systems, than they would be having faster CPUs. I’m sure there will be some killer app in the future that will set a different baseline for computers (maybe good speach recognition for example), but until that happens they don’t need faster computers.

I think that handles the majority of the computer users out there. But let’s move on to the other case of users who really do need powerful machines.

First, we need to recognize that adding more cores often comes at the cost of slower clock rates. The total number of cycles per a second if you combine all the cores is greater, but the speed of each individual core is slower. That means that to take advantage of all those cores, you have to have a problem that parallelizes across multiple cores. If your software only runs on one CPU you get slower not faster.

Second, we reflect on Ahmdal’s Law. Ahmdal’s law basically says that the speed up you get by parallelizing a problem is limited by the amount of time that must be done serially. For example, if you have a problem where 90% of the computation can be done in parallel, and 10% of the computation is spent managing the CPUs, distributing data, and collating results. As you add more cores the time spent on the parallel part of the computation goes down, but the serial computation remains fixed (or sometimes gets worse, but we’ll talk about that later). As you approach an infinite number of cores, the parallel processing time goes to 0. But that 10% of the time remains constant. That means that if 90% of your compute time can be parallelized, no matter how many cores you throw at it, it will never run more than 10x faster than one core. If your code is 99% parallizable, then you can not get better than 100x faster. That’s the best case, but the situation is actually worse than that.

We already mentioned one of the issues that limits that performance. As you increase the number of cores, the speed of the cores typically decreases. I realize that Ghz isn’t really a good metric, but for this exaxmple, I’m going to use it. I can buy a 3.2Ghz single core processor or I can buy a 2.0 Ghz quad core processor. That quad core gives me 8 Ghz. that should be much faster than 3.2Ghz, right? Well, let’s look at my 90/10 problem above. Let’s say it takes 10 seconds on the one processor. If we ran that on one of our 2.0 GHz cores, it should take 3.2 Ghz/ 2.0 Ghz times longer which is 16 seconds. Now we know that 10% of the time 1.6 seconds is contant, the other 14.4 seconds we’ll assume parallelizes perfectly across the 4 cores for a time of 3.6 seconds. Our total time is therefore 1.6 + 3.6 = 5.2 seconds. We had 2.5 times the computing power but we only got a 1.9 decrease in compute time. Ahmdal’s law and slower cores used up the rest of that performance.

Now I’ll add another problem to the mix. As Ahmdal said, your performance increase is limited by the time spent in the parallel part of the process. As you add more cores, the time spent managing them goes up. Consider that your problem has to be devided among all the CPUs and the results need to be put back together. The more cores you have the more data you need to move around. Sometimes a data structure needs to be updated by one core at a time to avoid each core writing on top of each other. The more cores you have the more time each of them spends waiting for their turn to write the data structure. These problems are hard to describe in detail, but they do contribute to the time spent doing serial processing. So in an ideal world 10% of your compute time would be spent on serial data regardless of how many cores are involved by in reality the percentage of time goes up as the number of cores goes up.

Finally, I’ll describe the lockstep or partitioning problem. In our example 90/10 example I assumed that the 14.4 seconds would be split perfectly among the 4 cores. It turns out that in many cases splitting a problem up perfectly is very difficult. Let’s take a familiar parallel processing problem like rendering an image. If we had four cores we might just split the screen in to quarters and have each core handle one quarter of the screen. In most cases that would work fairly well. But let’s say my image is blank in one corner and has a bunch of mirrors in another corner. In that case, the core handling the blank section will get done quickly. Where the corner with the mirrors takes more work to calculate and takes longer. Since you can’t display the image until all the cores are finished, that means you end up waiting for the slowest core. You might argue that you could break the image in to more pieces, to make the partitioning more fair, and you’d be right, that does help. You might also devise a clever way that when one core finishes early it takes some of the work from a core that is running more slowly. Again, a great idea that will help. But neither of these solutions solves the problem. These extra steps add overhead which slows down the general case, and in the end you still have some cores finishing before others. The time spent with idle cores lowers our overall performace improvement.

One other problem, that’s tangentially related to these discussion, is that how you parallelize your code is often dependent on the target system. If I know my code is going to run on a machine that has 1000 cores, then I’m going to break my problem down in to very small pieces. In doing so, I add a lot of overhead. If I use that same technique on a system with only four cores, then that extra overhead is much more expensive. Applications do their best to partition their problem in to the right size for the number of cores they have, but that isn’t easy and isn’t always perfect. Parallelizing your application for one platform is hard enough, and parallizing it for a wide range of platforms is much more difficult.

Even with all the nay saying there are some applications or portions of applications will be able to use more cores effectively. There are exciting projects that we’re undertaking where we use as many cores as we can get. But there are also bottlenecks where you want one core to run as fast as possible. It would be interesting to work with an architecture where you had a few really fast cores and many slower cores. The exact balance would vary by application, but it would provide a unique way to handle the serial and parallel programming aspects. That’s a problem for the hardware architects to chew on.

I also believe that specialized cores (such as GPUs) are interesting solutions. They work very well for our application, because we’re doing a lot of image processing, but perhaps specialized cores that performed matrix operations quickly might be useful to other clients. The downside to these non-standard cores is that programming them to take advantage of their power is often difficult. The tools need to be put in place that make it just as easy to use these special cores as it is to use a standard core.

Finally, compilers need to get a lot smarter. They need to determine when there is parallelism to exploit and to automatically generate the code to do it. To do that, you need to be able to characterize these cores well enough that a compiler can decide whether the performance improvement would be more than the overhead. Again if we consider that the you might write the application for a wide variety of platforms, that would require that this sort of optimization be done at run-time. How much just in time compiling would need to be done?

All these cores in a variety of designs is making for a very interesting computing landscape for the years to come. I hope we continue to make the most of it.

Comments are closed.