Chapter 2: Reducing Transitions

One way in which power efficiency can be increased is to reduce the amount of circuit activity required to carry out a computation. This chapter considers two particular ways in which the number of signal transitions that occur while executing a program can be reduced.

The first section is concerned with increasing code density. If code density can be increased then the number of signal transitions in the instruction memory and instruction fetch logic is reduced. The second section considers how datapath activity can be reduced and proposes that when operands are smaller than the whole width of the datapath only the required part of the datapath should be activated.

2.1 Code Density

Code density is a measure of the compactness of encoding of a processor's instruction set. In this section the relationship between code density and power efficiency is shown, and ways in which code density can be altered through changes to the instruction set architecture are investigated. Much of this section is based on previous work by the author in [ENDE93].

The code density of a processor can be defined as the reciprocal of the amount of program code required to perform some standard benchmark task. There are two variants of this measure:

Static code density measures the total amount of memory required to store the code for the benchmark program.

Dynamic code density measures the total amount of processor - memory code communication required for the execution of the benchmark program.

Static code density and dynamic code density tend to be correlated, but factors such as compiler optimisation can lead to differences. For example, if the compiler unrolls loops in the program the static code density will decrease but the dynamic code density will increase.

Static code density is important when questions such as the size of cache memories are considered; a processor with a higher static code density can obtain equivalent performance with a smaller cache. For questions related to power efficiency it is generally the dynamic code density that is most important.

As a first approximation power consumption in the memory and related parts of the system is inversely proportional to the dynamic code density. If dynamic code density could be doubled then the activity in the memory would be halved, and consequently its power consumption would be halved. The power consumption associated with the main memory system is particularly important as the main memory is typically on a different chip from the processor itself and inter-chip communication is less power efficient than on-chip communication.

Compared with previous generations, modern RISC processors have a relatively low code density. This has occurred for a number of reasons:

In order to minimise the complexity and increase the speed of the processor's instruction decoding logic, the instruction encoding is simple and hence redundant. Older designs used a microcode ROM for decoding which made more arbitrary encodings possible.

All instructions are the same length, and the length is determined by the longest required instruction. Other instructions have unused bits. Previous designs used variable length instructions.

RISC processors have a large number of registers. Typically five bits are used to specify one of 32 registers, and instructions may specify three registers, requiring a total of 15 bits. In earlier systems compilers were unable to make use of so many registers and fewer were provided.

Previous work [ENDE93] measured the potential improvement in code density that could be achieved for a SPARC processor by changing certain aspects of its instruction set architecture. Please refer to appendix A for details of the benchmark programs used. The possible improvements are summarised in table 2.1; following sections interpret many of the values in this table. It can be seen that there is no single way in which the code density can be dramatically improved, though taken together a combination of these changes could have a substantial effect.

Table 2.1: Improvement in SPARC code density resulting from instruction set changes


	Change						Relative code density

1	Remove unused bit fields			1.03

2	Fixed length 5-bit opcode field			1.08

3	Huffman encoded variable-length opcodes		1.25

4	10-bit branch displacements			1.07

5	1, 3 and 10-bit add and subtract immediates	1.04

6	2- and 3-address instructions			1.05

7	Explicit use of last result			1.05

8	Explicit generation and use of last result	1.08

9	16 registers					1.01

10	Workspace pointer register			1.03

11	Load and store multiple				1.11

This section investigates two of these areas where SCALP makes improvements over conventional architectures. These areas are the use of variable length instructions and the code density of register specifiers. The final subsection describes a number of other processor designs that have higher than usual code density.

2.1.1 Variable Length Instructions

There are two ways in which the use of variable length instructions can improve code density:

Opcodes can be encoded with longer codes for less common operations and shorter codes for more frequent operations. Line 3 of table 2.1 above shows that this technique can increase SPARC code density by 25 %; however to achieve this increase it is necessary to allow opcodes to be any number of bits in length. Realistic variable length instruction formats will allow only a small number of instructions lengths, reducing the benefit of this approach.

The lengths of immediate fields can be varied. Lines 4 and 5 of table 2.1 shows that if a branch instruction with an immediate field of 10 bits and arithmetic operations with immediate fields of 1, 3 and 10 bits are added SPARC code density can be increased by 11 %.

The reason why RISC processors avoid variable length instructions is that they are more difficult to decode than fixed length instructions. This is especially true when a superscalar processor needs to decode more than one instruction at once.

Figure 2.1 shows how two fixed length instructions can be decoded in parallel. In contrast figure 2.2 shows how five variable length instructions in the same total number of bits must be decoded (the solid lines indicate instruction boundaries, the dotted lines indicate boundaries between parts of the same instruction). Because it cannot tell in advance where the boundaries between instructions lie the decoder must consider the instruction parts serially, making the operation significantly slower.

Figure 2.1: Decoding Fixed Length Instructions

Figure 2.2: Decoding Variable Length Instructions

There are ways in which decoding variable length instructions can be made easier. The simplest modification is to ensure that the first part of each instruction contains all of the control information to indicate the total length of the instruction. In this case the decoder can operate more quickly as shown in figure 2.3.

Figure 2.3: More Efficient Variable Length Instruction Decoding

A more radical approach is indicated in figure 2.4. Here a "control field" is associated with each group of instructions. The control field is decoded first and indicates the layout of the instructions within the group to the decoder. The decoder is therefore able to decode all instructions in parallel once the control field has been decoded, making the decoding process nearly as fast as the fixed length instruction decoding. This is the technique used by SCALP.

Figure 2.4: Variable Length Instruction Decoding with a Control Field

2.1.2 Register Specifiers

RISC processors typically use up to about half of the bits in their instructions to represent register numbers. When other techniques such as less redundant opcode encodings and variable length instructions are used this fraction grows more significant. It is important to be able to reduce this factor in order to increase overall code density.

Simply reducing the number of registers is not a solution. Compilers are able to use 32 registers effectively and if the number is reduced the number of load and store instructions is increased.

Reducing the number of register specifiers per instruction is a strong possibility. It is common to find that the destination register is the same as one of the source registers; this was quantified in [ENDE93] (see appendix A) and is summarised in table 2.2. This form of encoding is known as a 2-address encoding rather than a 3-address encoding. Table 2.2 shows that in around one third of instructions one register specifier can be eliminated and a 2-address instruction used. Line 6 of table 2.1 indicates that this can lead to a 5 % code density increase.

Table 2.2: Frequency of 2-address instructions in SPARC code


Instruction category			Example		% of instructions

No register specifiers			Branch		19.7

One register specifier			Move immediate	12.0

Two register specifiers, equal		ADD R1,R1,#1	25.6

Two register specifiers, not equal	ADD R1,R2,#1	32.4

Three register specifiers, two equal	ADD R1,R1,R2	5.8

Three register specifiers, not equal	ADD R1,R2,R3	4.4

Total potential 2-address instructions			31.4

Using 2-address instructions is a way of exploiting the locality of register specifiers: if a particular register is given by one register specifier in an instruction then there is an increased chance that it is given by another register specifier in the same instruction. This idea of locality can be taken further. It can be observed that if a register is used in one instruction it has an increased chance of being used by preceding or following instructions.

It is very common for the result that has been computed by one instruction to be used by the immediately following instruction. By referring to "the result of the last instruction" in shorthand rather than by giving its register number a significant code density increase can be obtained. Table 2.3 summarises measurements of this property made in [ENDE93] (see appendix A).

Table 2.3: Frequency of Last Result Re-use


	Instruction category				% of instructions

A:	Use the result of the previous instruction	43.8

B:	As A, but excluding branch instructions that
	use the boolean result of a preceding compare	29.4

C:	As B, and the value used is not used by any
	subsequent instructions				20.6

Line 7 of table 2.1 indicates that code density can be increased by 5 % by using an instruction encoding which indicates last result reuse in shorthand.

Line C of table 2.3 suggests how this idea can be taken further. Many instructions use the result of the previous instruction and that result is subsequently never used again. In this case it is not necessary for either instruction to provided a register specifier to refer to this value; both may use shorthand. Line 8 of table 2.1 indicates that both forms of shorthand together provide an 8 % code density increase.

It is possible to go further and to allow instructions to refer to the results of other preceding instructions, leading to further code density improvements. This idea forms the basis of the "explicit forwarding" mechanism used by SCALP which is described in chapter 5.

2.1.3 Previous High Code Density Processors

Conventional RISC processors have relatively poor code density due to their regular fixed length instruction format. In comparison the preceding generations of CISC processors made use of a more dense variable length instruction encoding.

In addition to the CISC processors a number of other architectures have increased code density. This section considers three processor designs; D16 and Thumb use 16-bit RISC-like instructions whereas the transputer uses a variable length stack based instruction set.

D16

[BUND94] proposes a 16-bit RISC-like instruction set called D16. This is based on DLXe which is in turn very similar to the MIPS architecture. Comparisons are made between the code density and power efficiency of D16 and DLXe.

D16's instruction set is relatively simple with only five instruction formats. The instructions specify at most two register specifiers and can access 16 registers.

The dynamic code density of D16 is found to be around 70 % higher than that of DLXe. However as the 16 bit instruction set contains less redundancy than the conventional instruction set, more bits on the instruction fetch bus change per cycle. For D16 on average 3.88 bits change per byte of instruction; for DLXe 2.85 bits change. This means that the 70 % code density increase leads to a power efficiency increase on the instruction fetch bus of only around 30 %.

Thumb

Thumb [ARM95] is an extension to the ARM architecture that extends the instruction set to include a set of 16-bit instructions. These 16 bit instructions are dynamically translated to ARM 32 bit instructions during decoding.

The Thumb instruction set is complex with 19 different instruction formats. Most operations use a two address format and can access 8 registers. The standard 32 bit ARM instruction set is available when necessary; for example the Thumb instruction set contains no floating point instructions.

The motivation for Thumb is to increase code density so that ROMs in embedded applications may be smaller, and so that performance can be increased when the processor is connected to a byte or 16 bit wide instruction fetch bus. The reduced instruction fetch requirement also leads to increased power efficiency. Thumb's code density is around 50 % higher than the standard ARM processor.

Transputer

The transputer [MITC90] is an unusual processor with two key features that lead to increased code density.

Firstly the execution model is based on a stack rather than a register bank. Most instructions receive their operands from and write their results to a three entry stack. Other data must be stored in memory, though a small memory is integrated with the processor making this relatively efficient. Memory accesses may be relative to a workspace pointer.

Secondly short variable length instructions are used. Instructions are multiples of one byte long. The most common instructions using only implicit stack addressing without immediate values are encoded in a single byte; less common instructions and immediate values are encoded using "prefix instructions".

2.2 Datapath Activity

Conventional RISC processors normally have a 32 or 64 bit wide datapath consisting of register bank, ALU, shifter, multiplexors, latches etc. This datapath works efficiently when it is processing quantities of this full width such as addresses but many of the values dealt with by a program are significantly narrower than this. Examples include characters which are encoded by only 8 bits and booleans which require only one. When a 32 bit datapath is used to operate on these smaller quantities much of the power used is wasted.

The solution is to add operations to the instruction set that operate on small quantities. When executing these instructions the processor need only activate a part of the datapath.

The power benefit of the reduced datapath activity must be balanced against the reduced code density resulting from the additional width information in the instructions. SCALP chooses the simplest possible scheme with each instruction using a single bit to indicate whether it operates on byte or word quantities.

Many earlier CISC processor architectures included datapath operations of more than one width. In some cases these features lead to complications when advanced implementations of these architectures were considered. Consider the following example:

An early instruction writes a 32-bit value into a register.

A later instruction starts execution and will write an 8-bit value into the same register.

The immediately following instruction wishes to read a 32-bit value from this register.

In a pipelined implementation, a forwarding path would be used to send the result of the second instruction to the third. However the value that the third instruction should receive should contain 8 bits from the second instruction and 24 from the first instruction. It is therefore necessary for the third instruction to get part of the value from the forwarding path and part of it from the register bank. This is complex and undesirable.

This problem has recently been reported in the Intel P6 processor which is a superscalar implementation of the old 8086 architecture [GWEN95].

To avoid this type of problem SCALP defines the most significant bits of a value to be undefined when an instruction has operated on only the least significant byte.

Next Chapter