Optimisation Strategies - Pipelining
The following topics will be discussed in this section:
Pipelining
One way to improve the instruction:clock cycle ratio is by pipelining. A pipeline, in principle, is a bit like a factory production line, where a given section of the production line is dedicated to a particular task. The best way to describe pipelining in a CPU is with the aid of a diagram. The simple representation below shows a theoretical 5-stage pipeline. Let's call the first instruction to be executed 'Instruction A'. During the first clock cycle, a unit I have generically described as the 'Instruction Unit' carries out the job of fetching the instruction(s) from memory or cache. This is the first stage of our pipeline.
On the second clock cycle (the second stage of our pipeline), the decode unit carries out the instruction decode, while at the same time the fetch step of Instruction B is processed, back in the first stage (the IU) of the pipeline. Further instructions keep entering the pipeline until it is full.
| Instruction fetch (IU) |
Instruction decode (ID) |
Operand fetch (OF) |
Instruction execute (IE) |
Write back |
|
| Cycle 1 | AFetch | ||||
|---|---|---|---|---|---|
| Cycle 2 | BFetch | ADecode | |||
| Cycle 3 | CFetch | BDecode | AGenerate | ||
| Cycle 4 | DFetch | CDecode | BGenerate | AExecute | |
| Cycle 5 | EFetch | DDecode | CGenerate | BExecute | AWrite * |
* - after this stage Instruction A has been completed
An instruction is not said to be complete until it pops off the other end of the pipeline. In theory, if the pipeline remains full, during each clock cycle one instruction will be fetched, another will be decoded, another will executed, and so on. The net result is that for every clock cycle, one instruction will enter the pipeline while another one will pop off the end. In this way we achieve a 1:1 ratio of instructions per clock cycle.
To ensure that this process works efficiently, there are a couple of prerequisites. Firstly, each instruction step should be simple enough that it can be completed within a single clock cycle. This suggests a RISC implementation. Otherwise a particular step could bottleneck the pipeline. Assuming that this is true, we can conclude that the length of the pipeline is not particularly important; the crucial point is that it should always be full.
Furthermore, if one stage of the pipeline takes longer than another, then it follows that one or more stages of the pipeline will spend some time idle. Thus in the ideal situation, all pipeline stages would require the same amount of time.
The astute reader may have noticed that by saying "each instruction step should be simple enough that it can be completed within a single clock cycle" essentially defines the duration of a CPU's clock cycle. In short, the duration (i.e. time taken) for a clock cycle is limited by the most complicated step of the pipeline. The way around this limitation is to make a complex step more simple, by splitting it into smaller steps. Thus the duration of a cyle can be reduced and the clock speed may be increased.
It will come as no surprise, therefore, that a single phase of the instruction process can not normally be completed in a single stage. For this reason, we find that pipelines in modern CPUs have several stages dedicated to particular instruction steps. This deepening of the pipeline is called superpipelining.
One may therefore ask, How long should a pipeline be? In theory, the longer the pipeline, the shorter the duration of each stage. Taking this to the limit, an infinitely long pipeline would have a clock cycle that completes in no time at all! I.e. the clock speed would also be infinite! Of course, each stage can only be simplified so much, so this imposes a lower limit on the duration of a clock cycle (and an upper limit on the clockspeed); i.e. the clock cycle will take as long as the most complicated step. The downside is that long pipelines suffer larger performance penalties in the event of a pipeline stall. This is what happens when the pipeline does not remain full or when incorrect instructions are inserted into the pipeline. This will be discussed in more detail on the following pages.
Superscalar Architecture
Before I continue with an in-depth discussion of
how to optimise the use of the pipeline,
it is worth bearing in mind that modern chips
have a so-called Superscalar architecture.
This means that pipeline stages are duplicated. For example,
the core may have more than one execution unit, and usually
duplicates other parts of the pipeline as well, resulting in a 'multi-pipeline'
architecture.
In theory, two pipelines running in parallel can allow 2 instructions per clock cycle, provided there are no dependencies between the threads in each pipeline.
I have tried to demonstrate this idea with the diagram. Note that this diagram is a simplification: individual 'units' may actually consist of more than one pipeline stage. For example, the floating point unit may itself consist of several stages, and this number of stages need not the number that makes up the integer units.
Note: sometimes superscalar is defined as an architecture capable of delivering an instruction:clock cycle ratio of greater than 1.
Keeping the Pipeline Full
Obviously, if the pipeline does not remain full, instructions will not be popping off the pipeline during these cycles, thereby reducing the instructions/cycle ratio. There are several barriers to keeping the pipeline full and many of the CPU optimisation strategies discussed in the next few pages have one goal: to keep the pipeline full.
For example, as mentioned on the previous page, fetching data from main memory is slow and thus the instruction fetch phase of the pipeline would form a bottleneck were it not for the instruction queue and cache.
What's next
The next page looks at cache.
| Just Too Good Last updated: June, 2006 (DJL) |
Drop me a line