Content
How Can a GPU Outperform a CPU?
Introduction—how CPUs work
We need to learn some architectural details to understand the different trade-offs that CPUs and GPUs make and why they make sense.
The first thing to look at is how actually meaningful operations are executed by modern microprocessors. These devices are made up of many distinct parts, but the most interesting to us are the arithmetic and logical units (ALUs) that actually carry out elementary operations. The inputs and outputs of these units are registers that are very small areas able to hold a few dozen bytes of data. They are so close to the ALUs that their access can be thought of as being immediate. However, this is not the case for the main memory: data from that needs to travel a long distance and go through many intermediate steps before landing in a register and be available for further operations.
The actual number of clock cycles that are required for a memory access to be served can be as long as 100 cycles. For comparison, most simple operations like addition and multiplication usually take a few cycles, so in case the device has to wait for data to arrive, it is wasting precious computation time.
CPUs try to mitigate this problem by using large caches and clever prefetching logic. Caches are on-chip memory that are faster because they are near the ALUs, but as they take up space on the processor itself, they are quite expensive and limited in size. Usually, caches are organized into levels, for example, L1 caches are a few hundred KBs in size and are closest to the ALUs. L2 caches are larger, around one MB, and might or might not be shared across cores, and L3 caches can be even 10 MBs and usually shared across many cores.
CPUs cache everything they can, including data, instructions and memory addresses, but actually these are just the obvious ones. Prefetching is the idea that the processor tries to guess what will be needed next and tries to load it ahead of time to avoid stalls. Together, these and various other measures can in usual cases very well hide the large memory latencies, especially for well-organized data layouts and access patterns. One could say, that CPUs are latency optimized: they try to deliver the result as soon as possible to you.
However, these considerations are to be taken from the viewpoint of a single thread running on a single CPU core. We need to keep in mind that concurrent multicore operations can be a real challenge both for the caches and the programmers.
How do GPUs work—what is the trick?
Now GPUs are a bit different. If we look at a GPU chip, we see the thousands and thousands of execution units filling up large parts of the chip as opposed to the huge plains of caches on CPUs. So the balance is entirely reversed: on GPUs the amount of cache per core, whatever definition we reasonably use is much smaller than in the case of the CPUs.
However, memory is still memory even on the GPUs, and there is still latency involved in its access with all its hundreds of clock cycles. So the billion-dollar question is, how the GPUs hide this latency to be usable at all, without relying on caches as much as the CPU?
The trick is that in contrast to the CPU, on the GPU we are running the same program on all execution units, while on the CPU we cannot make this assumption. Such a constraint — and various further aspects, like having relatively simple tasks, a hardware scheduler and usually no need for a stack — opens up the possibility to switch execution across threads with minimal work. Thus, while one thread is waiting for some memory operation to complete, we can switch over to another thread and make some progress there, until we run into another memory operation, and then we switch to yet another thread. Assuming that there are enough threads, we can always make progress with some of them.
Since modern GPUs offer thousands of execution units, we can easily launch thousands of units of work, but even that is not enough: for this latency hiding to be effective, we need to oversaturate the device. The usual rule of thumb recommends issuing work for four times the number of execution units.
One could say that while the CPUs are optimized for latency, the GPUs are optimized for throughput: it might take much, much longer until the first result will be available, but that won't be a single result, it might easily be 100 000 results!