One way in which power efficiency can be increased is to reduce the amount of circuit activity required to carry out a computation. This chapter considers two particular ways in which the number of signal transitions that occur while executing a program can be reduced.
The first section is concerned with increasing code density. If code density can be increased then the number of signal transitions in the instruction memory and instruction fetch logic is reduced. The second section considers how datapath activity can be reduced and proposes that when operands are smaller than the whole width of the datapath only the required part of the datapath should be activated.
Code density is a measure of the compactness of encoding of a processor's instruction set. In this section the relationship between code density and power efficiency is shown, and ways in which code density can be altered through changes to the instruction set architecture are investigated. Much of this section is based on previous work by the author in [ENDE93].
The code density of a processor can be defined as the reciprocal of the amount of program code required to perform some standard benchmark task. There are two variants of this measure:
Static code density and dynamic code density tend to be correlated, but factors such as compiler optimisation can lead to differences. For example, if the compiler unrolls loops in the program the static code density will decrease but the dynamic code density will increase.
Static code density is important when questions such as the size of cache memories are considered; a processor with a higher static code density can obtain equivalent performance with a smaller cache. For questions related to power efficiency it is generally the dynamic code density that is most important.
As a first approximation power consumption in the memory and related parts of the system is inversely proportional to the dynamic code density. If dynamic code density could be doubled then the activity in the memory would be halved, and consequently its power consumption would be halved. The power consumption associated with the main memory system is particularly important as the main memory is typically on a different chip from the processor itself and inter-chip communication is less power efficient than on-chip communication.
Compared with previous generations, modern RISC processors have a relatively low code density. This has occurred for a number of reasons:
Previous work [ENDE93] measured the potential improvement in code density that could be achieved for a SPARC processor by changing certain aspects of its instruction set architecture. Please refer to appendix A for details of the benchmark programs used. The possible improvements are summarised in table 2.1; following sections interpret many of the values in this table. It can be seen that there is no single way in which the code density can be dramatically improved, though taken together a combination of these changes could have a substantial effect.
Change Relative code density 1 Remove unused bit fields 1.03 2 Fixed length 5-bit opcode field 1.08 3 Huffman encoded variable-length opcodes 1.25 4 10-bit branch displacements 1.07 5 1, 3 and 10-bit add and subtract immediates 1.04 6 2- and 3-address instructions 1.05 7 Explicit use of last result 1.05 8 Explicit generation and use of last result 1.08 9 16 registers 1.01 10 Workspace pointer register 1.03 11 Load and store multiple 1.11This section investigates two of these areas where SCALP makes improvements over conventional architectures. These areas are the use of variable length instructions and the code density of register specifiers. The final subsection describes a number of other processor designs that have higher than usual code density.
There are two ways in which the use of variable length instructions can improve code density:
The reason why RISC processors avoid variable length instructions is that they are more difficult to decode than fixed length instructions. This is especially true when a superscalar processor needs to decode more than one instruction at once.
Figure 2.1 shows how two fixed length instructions can be decoded in parallel. In contrast figure 2.2 shows how five variable length instructions in the same total number of bits must be decoded (the solid lines indicate instruction boundaries, the dotted lines indicate boundaries between parts of the same instruction). Because it cannot tell in advance where the boundaries between instructions lie the decoder must consider the instruction parts serially, making the operation significantly slower.
There are ways in which decoding variable length instructions can be made easier. The simplest modification is to ensure that the first part of each instruction contains all of the control information to indicate the total length of the instruction. In this case the decoder can operate more quickly as shown in figure 2.3.
A more radical approach is indicated in figure 2.4. Here a "control field" is associated with each group of instructions. The control field is decoded first and indicates the layout of the instructions within the group to the decoder. The decoder is therefore able to decode all instructions in parallel once the control field has been decoded, making the decoding process nearly as fast as the fixed length instruction decoding. This is the technique used by SCALP.
RISC processors typically use up to about half of the bits in their instructions to represent register numbers. When other techniques such as less redundant opcode encodings and variable length instructions are used this fraction grows more significant. It is important to be able to reduce this factor in order to increase overall code density.
Simply reducing the number of registers is not a solution. Compilers are able to use 32 registers effectively and if the number is reduced the number of load and store instructions is increased.
Reducing the number of register specifiers per instruction is a strong possibility. It is common to find that the destination register is the same as one of the source registers; this was quantified in [ENDE93] (see appendix A) and is summarised in table 2.2. This form of encoding is known as a 2-address encoding rather than a 3-address encoding. Table 2.2 shows that in around one third of instructions one register specifier can be eliminated and a 2-address instruction used. Line 6 of table 2.1 indicates that this can lead to a 5 % code density increase.
Instruction category Example % of instructions No register specifiers Branch 19.7 One register specifier Move immediate 12.0 Two register specifiers, equal ADD R1,R1,#1 25.6 Two register specifiers, not equal ADD R1,R2,#1 32.4 Three register specifiers, two equal ADD R1,R1,R2 5.8 Three register specifiers, not equal ADD R1,R2,R3 4.4 Total potential 2-address instructions 31.4Using 2-address instructions is a way of exploiting the locality of register specifiers: if a particular register is given by one register specifier in an instruction then there is an increased chance that it is given by another register specifier in the same instruction. This idea of locality can be taken further. It can be observed that if a register is used in one instruction it has an increased chance of being used by preceding or following instructions.
It is very common for the result that has been computed by one instruction to be used by the immediately following instruction. By referring to "the result of the last instruction" in shorthand rather than by giving its register number a significant code density increase can be obtained. Table 2.3 summarises measurements of this property made in [ENDE93] (see appendix A).
Instruction category % of instructions A: Use the result of the previous instruction 43.8 B: As A, but excluding branch instructions that use the boolean result of a preceding compare 29.4 C: As B, and the value used is not used by any subsequent instructions 20.6Line 7 of table 2.1 indicates that code density can be increased by 5 % by using an instruction encoding which indicates last result reuse in shorthand.
Line C of table 2.3 suggests how this idea can be taken further. Many instructions use the result of the previous instruction and that result is subsequently never used again. In this case it is not necessary for either instruction to provided a register specifier to refer to this value; both may use shorthand. Line 8 of table 2.1 indicates that both forms of shorthand together provide an 8 % code density increase.
It is possible to go further and to allow instructions to refer to the results of other preceding instructions, leading to further code density improvements. This idea forms the basis of the "explicit forwarding" mechanism used by SCALP which is described in chapter 5.
Conventional RISC processors have relatively poor code density due to their regular fixed length instruction format. In comparison the preceding generations of CISC processors made use of a more dense variable length instruction encoding.
In addition to the CISC processors a number of other architectures have increased code density. This section considers three processor designs; D16 and Thumb use 16-bit RISC-like instructions whereas the transputer uses a variable length stack based instruction set.
[BUND94] proposes a 16-bit RISC-like instruction set called D16. This is based on DLXe which is in turn very similar to the MIPS architecture. Comparisons are made between the code density and power efficiency of D16 and DLXe.
D16's instruction set is relatively simple with only five instruction formats. The instructions specify at most two register specifiers and can access 16 registers.
The dynamic code density of D16 is found to be around 70 % higher than that of DLXe. However as the 16 bit instruction set contains less redundancy than the conventional instruction set, more bits on the instruction fetch bus change per cycle. For D16 on average 3.88 bits change per byte of instruction; for DLXe 2.85 bits change. This means that the 70 % code density increase leads to a power efficiency increase on the instruction fetch bus of only around 30 %.
Thumb [ARM95] is an extension to the ARM architecture that extends the instruction set to include a set of 16-bit instructions. These 16 bit instructions are dynamically translated to ARM 32 bit instructions during decoding.
The Thumb instruction set is complex with 19 different instruction formats. Most operations use a two address format and can access 8 registers. The standard 32 bit ARM instruction set is available when necessary; for example the Thumb instruction set contains no floating point instructions.
The motivation for Thumb is to increase code density so that ROMs in embedded applications may be smaller, and so that performance can be increased when the processor is connected to a byte or 16 bit wide instruction fetch bus. The reduced instruction fetch requirement also leads to increased power efficiency. Thumb's code density is around 50 % higher than the standard ARM processor.
The transputer [MITC90] is an unusual processor with two key features that lead to increased code density.
Firstly the execution model is based on a stack rather than a register bank. Most instructions receive their operands from and write their results to a three entry stack. Other data must be stored in memory, though a small memory is integrated with the processor making this relatively efficient. Memory accesses may be relative to a workspace pointer.
Secondly short variable length instructions are used. Instructions are multiples of one byte long. The most common instructions using only implicit stack addressing without immediate values are encoded in a single byte; less common instructions and immediate values are encoded using "prefix instructions".
Conventional RISC processors normally have a 32 or 64 bit wide datapath consisting of register bank, ALU, shifter, multiplexors, latches etc. This datapath works efficiently when it is processing quantities of this full width such as addresses but many of the values dealt with by a program are significantly narrower than this. Examples include characters which are encoded by only 8 bits and booleans which require only one. When a 32 bit datapath is used to operate on these smaller quantities much of the power used is wasted.
The solution is to add operations to the instruction set that operate on small quantities. When executing these instructions the processor need only activate a part of the datapath.
The power benefit of the reduced datapath activity must be balanced against the reduced code density resulting from the additional width information in the instructions. SCALP chooses the simplest possible scheme with each instruction using a single bit to indicate whether it operates on byte or word quantities.
Many earlier CISC processor architectures included datapath operations of more than one width. In some cases these features lead to complications when advanced implementations of these architectures were considered. Consider the following example:
In a pipelined implementation, a forwarding path would be used to send the result of the second instruction to the third. However the value that the third instruction should receive should contain 8 bits from the second instruction and 24 from the first instruction. It is therefore necessary for the third instruction to get part of the value from the forwarding path and part of it from the register bank. This is complex and undesirable.
This problem has recently been reported in the Intel P6 processor which is a superscalar implementation of the old 8086 architecture [GWEN95].
To avoid this type of problem SCALP defines the most significant bits of a value to be undefined when an instruction has operated on only the least significant byte.