Optimisation Strategies - Cache

The following topics will be discussed in this section:


Cache

Traditionally, the fetch and store steps have been very slow due to the significant amount of time required to access system main memory. This is because the memory bus (FSB) is very slow compared to the CPU's own internal clock (and this is in turn predominantly ascribed to limitations on the speed of the memory itself). Every cycle counts!

A well-optimised program will have a minimum amount of main memory transfers. However, such transfers are inevitable and so a solution to the slow speed of the fetch operation must be found...

One way around this problem is to pull instructions from the much faster cache memory instead. As discussed in the memory section, cache memory exists as static RAM which is much faster than the DRAM which makes up a system's main memory.

How can this work? Well, programs, by design, tend to repeat the same instructions over and over again. This is in part because programmer's often use repeating loops in their code since this aids efficiency and code structure. Cache works by storing these often-used instructions (and data) locally, so that the CPU doesn't have to go all the way to main memory to find them.


Temporal Versus Spatial Data

Cache works by facilitating access to temporal and spatial data. Great. What does that mean? Put simply, temporal data is data which has been accessed recently. It is likely (for reasons already mentioned above) that data which has been accessed recently may well be required again. Thus data which has just been accessed should be stored in cache so that it can be fetched from this fast cache, should it be needed in the near future.

Spatial data works a little differently. In principle, if data has just been pulled from memory, there is a high probability that data located close to it in memory (in terms of memory address space) will also be needed in the near future. For example, consider accessing the elements of an array stored in memory. Each element will be stored in an adjacent location in memory. Programs often process all elements of an array, so if one element is read, it is likely that the others will also be read. As another example, consider instructions themselves. Programs are simply a series of instructions, stored one after another in memory. Unless the program involves some sort of branch, the next instruction will be located (in memory) adjacent to the current instruction. Thus, when fetching instructions from memory, it makes sense to also fetch those instructions that are nearby.

For this reason, data read from memory is typically read in blocks known as cache lines.


Storing Data

The examples discussed so far refer to fetching data from memory. Additionally, cache also speeds up the write back stage, since the CPU can write data to the fast cache rather than to (slow) system memory. Thus the CPU can carry on with the next instructions, rather than waiting for data to arrive in memory.

Cache Controller and Coherency

Reading and writing to cache is handled by the cache controller. Circuitry within this controller maps memory addresses to locations in cache. How does the controller work? In simplistic terms, whenever data is required that is stored in main memory, the cache controller checks to see if this data is present in cache. If it is, then the data is transferred from cache to the CPU, rather than from memory to the CPU. The CPU itself is unaware of the distinction.

If the data to be fetched is not stored in cache, then it is read from main memory. At the same time, this data is copied into the cache for use in the future. In addition, temporal/spatial data is also copied to cache, as discussed earlier.

What happens when data needs to be written to memory, i.e. during a store? Essentially, this happens like a fetch in reverse. However, the controller must act appropriately if there is no room in the cache for new data. Bear in mind that the cache is much smaller than main memory, and so space for data is very limited. The controller determines which data currently stored in the cache will be replaced by new data being written. This depends on replacement algorithms which are pretty complicated in modern CPUs.

There are two ways in which new data can be written to cache when performing a store:

  1. Write-through technique:
    Here, the data is written to the cache and to the main memory at the same time.

  2. Write-back technique:
    With this mechanism, the data is only written to the cache at the time of the store phase. This has the advantage of being much quicker, since the CPU doesn't have to wait for the memory to be updated. The system memory is itself updated at a later stage, by copying the data from the cache.

One issue of using cache is what happens when a device writes to memory independent of the CPU, i.e. DMA. I/O devices can directly update memory using DMA. The result is that the data stored in the cache is no longer the same as the corresponding data in memory. Thus we need some mechanism to maintain cache coherency, i.e. to insure that data in cache and memory are not mismatched. Fortunately, the cache controller keeps tabs of such memory writes and invalides the data stored in the cache. In simple terms, this means that the data in the cache will be ignored in the next fetch.

The write-back approach, while potentially yielding higher performance, creates another related problem. This is the problem of I/O devices reading from memory using DMA. When using write-back, the data stored in cache will be correct, but this data may not yet have been copied to RAM. In this case, when a DMA transfer from this memory location is requested, the cache controller issues a cache flush, whereby the contents of the cache are written to memory.


Cache Organisation

Cache memory is expensive. The closer the cache is located relative to the CPU, the faster it will be. At the same time, the closer it is, the more expensive it will be to manufacture. Furthermore, CPU manufacturers do not want to significantly increase CPU costs by dedicating large portions of the die area to cache.

Modern CPUs (well, since the 486) have Level 1 (L1) cache. This L1 cache is so-named because it is physically located on the chip itself, near the processor core. This is often referred to as 'internal cache', 'on-chip' or 'on-die' cache. This type of cache can be accessed at the full clock speed of the CPU. In the ideal situation, data required by the CPU will be present in this L1 cache.

Due to the considerable expense of integrating L1 cache, it is present in very small quantities - typically just a few kilobytes, although the amount grows with each generation of CPU). CPUs prior to the Intel 486 did not have any L1 cache at all; 486s had 8kB, which was doubled to 16kB in the Pentium. Today's CPUs can have as much as 64kB or more.

Thus, modern systems also implement level 2 (L2) cache memory. L2, or 'secondary' cache was traditionally located on the motherboard, in the order of 256kB to 2MB. It was present on the motherboard as far back as the days of the 386. Since it was present on the motherboard and not on the CPU itself, L2 cache is not constrained in terms of space in the same way that L1 cache is and so L2 cache was/is much cheaper to fabricate. Thus we can have much more of it! However, it is much less effective for three main reasons: (1) The bus linking the CPU to L2 cache typically runs at only a fraction of that of the CPU (although in the days of the 386, the memory bus speed and CPU ran at the same frequency); (2) L2 cache memory is inherently slower than L1 cache; (3) L2 cache is located at a much greater distance from the CPU.

However, today's chips make more and more use of secondary cache located much closer to the chip itself. For example, the L2 cache of the Pentium Pro existed in the same packaging (although not the same integrated circuit as the CPU itself) and can therefore support a much faster bus speed than was previously possible. It should be noted that this 'local' L2 cache is also often referred to as 'internal', 'on-die' or 'on-chip', to distinguish it from L2 cache located on the motherboard. This is something of a misnomer because L2 cache in the same packaging is not the same as cache on the same integrated circuit as the processor itself.

To enable the Pentium Pro fast access to the L2 cache while not compromising access to main memory, the memory bus was split to give us dual independent bus (DIB) architecture. Now, main memory access is via the frontside bus (FSB) which runs at a fraction of the CPU's core clockspeed. L2 cache access is via a dedicated backside bus which, in the case of the Pentium Pro, runs at the full CPU clockspeed. On many CPUs, this is not the case and the backside bus runs at a fraction (typically a half) of the CPU speed. Even so, it is still much faster than accessing main memory.

The other advantage of this strategy is that providing a dedicated bus to cache means that memory bus (FSB) bandwidth can no longer be 'used-up' by cache reads/writes (and vice versa).

L2 cache will always be quicker than main memory. Ideally, if data required by the CPU is not present in the L1 cache, it will be present in the L2 cache. Failing that, the CPU will typically have to wait for the data to be sent from main memory.

Cache hierarcy

(In passing, we should also mention that the AMD K6-III pioneered the use of tri-level cache, with L1 internal cache, L2 internal cache and L3 cache located on the motherboard. For more on the specifics of cache implementation with respect to particular CPUs, make sure you check out the CPU History section.)

Modern CPUs fetch as much as 90% of their data requirement from cache. Consider the AMD Athlon CPU which has 9 pipelines (more on this in the CPU History section). Keeping all these pipelines full requires a whopping amount of cache which is why the Athlon comes equipped with 128kB of L1 cache. Compare the Intel Pentium III which has only 32kB... This also illustrates why the early Intel Celeron chips, which had no L2 cache at all (in order to reduce production costs) were such slow performers compared to the Pentium II CPU and rival CPUs that were available at the time.

Modern CPUs also separate the L1 cache into two separate stores: a data cache and an instruction cache. Of course, instructions are nothing more than a specialised form of data, but separating the cache in this way allows access to both instructions and data in parallel, thus yielding greater performance.


Hit and Miss

When the cache is able to satisfy a read request, we have what is called a cache hit. Conversely, if a read request can not be satisfied by cache, then the system has to resort to main memory. This is called a miss. The steps a processor must take in order to recover from a miss constitute what is known as the cache penalty. A few years ago, this equated to the time taken to read the data from main memory. However, modern CPUs using out-of-order execution (see later) and large L2 caches can quench this penalty significantly.


What's Next

The next page looks at some of the more advanced concepts of CPU Optimisation, such as out of order execution, branch prediction and speculative execution.