Chapter 1: SCALP: A Superscalar Asynchronous Low-Power Processor

To design a superscalar microprocessor is a complex and challenging task. To do so using asynchronous logic - the subject of this thesis - makes it doubly difficult. So what is it that makes such a processor desirable? The answer is that superscalar and asynchronous operation, along with a number of other factors, should help to make SCALP a highly power efficient processor.

Power efficiency is becoming an increasingly important factor in digital design, particularly as a result of the proliferation of portable battery-operated applications. Proposed design techniques for low power systems are discussed later in this chapter. These techniques have been applied with particular success in special purpose signal processing applications such as the Philips Digital Compact Cassette Player [BERK95] and the Berkeley InfoPad [CHAN94]. In the case of general purpose processors however designers have been less adventurous.

Many constraints influence the designer of a general purpose microprocessor. Often it is necessary to maintain code compatibility with a previous design. If absolute code compatibility is not necessary then the processor is expected to follow a model that assembly language programmers and compiler code generators are used to. At the very least an execution model that is a suitable target for common high level languages is needed.

Other constraints limit the choice of implementation technology. Compatibility with existing memory and interface circuits requires that the processor operates in a conventional synchronous fashion at a fixed standard supply voltage. All these restrictions limit the designer's ability to apply novel and radical ideas that may have benefits in terms of power efficiency or other factors.

This work has suffered no such impediment. SCALP, the processor described here, deviates significantly from the conventional idea of a microprocessor in order to try architectural and implementation ideas that may contribute towards increased power efficiency.

This first chapter starts by providing a background explaining why power efficiency is becoming increasingly important. Proposed techniques for improving power efficiency are reviewed, and finally an overview of the remainder of the thesis is presented.

1.1 The Importance of Power Efficiency

There are several reasons why power efficiency is becoming increasingly important. Most importantly portable systems powered by batteries are performing tasks requiring increasing computational performance. At the same time these systems are becoming physically smaller and battery weight is becoming more significant. Users demand longer battery life and this can only be obtained by increasing the capacity of the battery or by increasing the efficiency of the logic. The rate of progress in battery technology is slow, so the onus is on the digital designer to improve efficiency.

There are other reasons why power is becoming important. As the heat output of chips increases it becomes more difficult and expensive to provide sufficient cooling in the form of special packages, heat sinks and fans. Furthermore increased temperatures lead to greater stress on the component and reduced reliability.

Electrical issues are also important. Providing a supply of sufficient capacity requires a large number of bond wires between the chip and the package, and a large proportion of the potential signal routing space is occupied by power distribution. High current densities can lead to electromigration. At the system level, increased power demands larger and more expensive power supplies.

Even the cost of the electricity used can be important. It is estimated that at the end of the decade 10 % of U.S. non-domestic electricity consumption will be by personal computers. Reducing this contribution would have significant economic and environmental benefits.

These factors, and others described in [ENDE93], combine to make power efficiency an increasingly important factor in the design of digital systems.

1.2 Low Power Design Techniques

This section reviews techniques that have been suggested for improving power efficiency. These techniques apply at a number of levels from transistors through to high level design. The savings obtained at each of these levels tend to combine multiplicatively, so a successful low power system must employ techniques at all levels.

The descriptions in the following sections apply to CMOS logic unless otherwise indicated.

1.2.1 Low-Level Power Efficiency

Process Technology

The ever-diminishing feature size possible with successive generations of fabrication processes is greatly beneficial to power efficiency. As feature sizes shrink capacitances are reduced and gates become faster. As a first approximation the energy per gate transition is proportional to the cube of the feature size and the gate delay is directly proportional to the feature size if the supply voltage is scaled with the feature size [WEST93].

As feature sizes continue to scale interconnect delays become increasingly significant. Interconnect delays do not scale with feature size or with supply voltage and so eventually limit the available performance.

Improving process technology also allows for the construction of increasingly large circuits on single chips. Inter-chip connections are far more power consuming and slower than on chip connections, so as the number of inter-chip connections necessary is reduced the power efficiency of the system is improved.

Transistor Level Design

The size of the transistors used to build basic logic gates affects their speed and power consumption. One objective is to obtain fast output changes as short-circuit current flows in the driven gate at intermediate voltages during the transition. On the other hand it is desirable to minimise transistor sizes in order to reduce capacitances. Recent work [FARN95] has investigated the ideal transistor sizes for greatest power efficiency.

In some cases asymmetric gates can be constructed; some input to output paths are faster and are used for critical paths while others are more power efficient and are used for less critical paths. The scope is greatest in the case of complex gates where the order of the transistors can be chosen based on known input transition orders and probabilities [GLEB95].

Charge Recovery and Adiabatic Systems

Conventional circuits charge nodes that must become high by connecting them to the positive supply and discharge nodes that must become low by connecting them to the negative supply. This mode of operation leads to the conventional CMOS power consumption equation:

	P prop to C Vsupply^2

Other schemes use different techniques. Adiabatic systems [SOLO94] use inductors, capacitors and sinusoidal clock / power signals to charge nodes in a more efficient fashion such that

	P prop to C Vsupply Vthreshold

More readily implementable schemes use charge sharing between signals; signals that are high and wanting to become low are connected to signals that are low and wanting to become high, moving all nodes to an intermediate level. Subsequently the power supply is used to complete charging or discharging the nodes [KHOO95].

Logic Optimisation

For conventional CMOS the majority of the power consumption is dynamic; that is power is consumed only when signals change. Power efficiency can therefore be improved by reducing the probability that signals will change.

In many cases the most common mode of operation of a block is known, for example a typical sequence of states for a state machine or a typical set of inputs for a combinational block. Low power consumption in this mode can be an objective of the design process. Algorithms have been proposed that provide state encodings and implementations for optimum power efficiency for state machines based on typical state sequences [OLSO94].

Supply Voltage Adjustment

Optimum power efficiency occurs when a circuit is powered from the lowest supply voltage at which it can provide the necessary performance. Recent work [USAM95] has suggested providing multiple supply voltages to a system. Gates on the circuit's critical path are operated from the higher voltage and those not on the critical path are operated from the lower voltage. Other work [NIEL94] [KESS95] proposes that the power supply to the whole circuit is dynamically adjusted in response to the varying workload (see section 4.1.4).

Parallelism

Increasing the parallelism in the system permits individual gates to operate more slowly while maintaining the same overall throughput. Slower operation may be performed at a lower supply voltage, increasing power efficiency [CHAN94] (see chapter 3).

Precomputation

Precomputation may be used to prevent unnecessary operations from being performed; for example in a comparator it is possible to initially compare only a few of the bits. Only if these bits match need the remaining bits be compared. In the case where the first comparison gives a false result power has been saved [ALID94].

Clocking Schemes and Asynchronous Logic

In synchronous systems significant power is consumed by the clock signal, its driver and the driven latches. Clock gating may be employed to disconnect the clock from parts of the circuit that are inactive at a particular time [BENI95].

A more radical approach is to eliminate the global clock altogether. In asynchronous systems, local timing signals control communication between blocks which occurs only when useful work has to be performed (see chapter 4).

1.2.2 Higher-Level Power Efficiency: Microprocessors

At higher levels of abstraction methods of increasing power efficiency are dependent on the particular application. In the case of microprocessors, this means the architectural properties of the processor and its instruction set.

Previous work by the author identified three particularly important architectural features: external memory bandwidth, cache characteristics, and datapath arrangement [ENDE93].

External Memory Bandwidth

The power consumed operating an external signal is substantially larger than the power used by an internal transition; consequently it is important to minimise number and size of external memory accesses. This can be effected by:

Increasing the code density; a processor with a higher code density will require fewer or smaller instructions to be fetched for a given workload.

Reducing the frequency of load and store operations. This may be achieved by for example incorporating a larger register bank.

Improving the on-chip cache characteristics. A higher cache hit rate will reduce the frequency of external memory accesses. Cache write policy is also important; write-through caches will cause more external memory accesses than copy-back caches.

Cache Characteristics

Because the cache is a large regular structure its power efficiency properties are both important and easy to study. Larger caches will consume more power than smaller ones; this must be traded off against the corresponding reduction in external memory access frequency mentioned above.

Greater power efficiency can be obtained with multiple levels of on-chip cache. In this case the power consumption should be determined by the small size of the more frequently used first level cache, whereas overall performance should be determined by the large size of the second or further levels.

Datapath Arrangement

Typical RISC processors have a 32-bit datapath which is used for all operations. Many of those operations will be operating on smaller quantities such as characters or small loop counters that could be processed using a small fraction of the datapath width. Power efficiency will be improved if the instruction set provides datapath operations of various widths.

1.3 Previous Low Power Processors

Most previous low power microprocessors make use of only low level power efficiency techniques in order to maintain code compatibility with previous high power versions of the processors. This section describes the low power techniques used by these conventional processors.

The ARM processor [FURB89] is considered to be one of the most power efficient microprocessors available [ARM95]. Its power efficiency is due primarily to its simplicity and consequent small size. The processor was designed for simplicity in order to reduce both design complexity and cost; the resulting low power consumption was a bonus. While most microprocessors are available only as single integrated circuits the ARM is available as a VLSI macrocell which can be incorporated into a larger chip. In this way the number of off chip interconnections required is reduced, increasing power efficiency.

Other processors that are inherently more complex must apply other mechanisms to increase their power efficiency. The most common approach is to introduce a number of power saving modes in which parts of the processor are disabled. Typically there would be several modes such as "standby", "sleep", "hibernate" etc. As deeper levels are entered the number of active blocks reduces. Generally these modes are controlled by operating system software. As one example the NEC 4100 processor [TURL95b] has standby, suspend, and hibernate modes that reduce power consumption to 10 %, 5 %, and virtually zero respectively. This processor is targeted at the portable "personal digital assistant" market. Another example is the Hitachi SH7708 processor [TURL95a] which allows the clock frequency to be dynamically changed to 1, 2, or 4 times the input frequency.

Some processor designs have made use of transistor sizing to minimise power consumption. The PowerPC 400 series processors [CORR95] have been designed using a large library of basic gates with many different combinations of transistor sizes. The choice of which gate to use is made by a synthesis tool; the result is that relatively few gates on critical paths use large transistors for high speed and the majority use smaller transistors for low power consumption.

Some other processors do incorporate architectural features that benefit power efficiency, though often these features were not incorporated with power efficiency in mind. Specifically section 2.1.3 considers a number of processors with high code density. Generally though no current microprocessors make high-level architectural choices such as changing the underlying execution model in order to improve power efficiency.

1.4 Overview of the Thesis

Whereas the previous work described above has considered primarily the low level power efficiency techniques, this work concentrates primarily on three of the higher level techniques described in section 1.2, namely parallelism, asynchronous logic and high code density.

The thesis is arranged as follows:

Chapters 2 to 4 consider each of the areas of interest in turn. Chapter 2 considers how the number of transitions required to execute a program can be reduced by increasing code density and reducing the datapath activity. Chapter 3 considers parallelism, particularly how parallelism is obtained in pipelined and superscalar processors. Chapter 4 considers asynchronous logic, investigating how its use can improve power efficiency and considering how the parallel structures described in chapter 3 can be implemented asynchronously.

Chapter 5 describes the SCALP architecture and explains how its novel organisation helps several of the objectives identified in the previous chapters. Chapter 6 gives an introduction to asynchronous design styles before describing the implementation of SCALP as a gate level VHDL model. It gives an overview of the design and then concentrates on some of the more interesting blocks. Chapter 7 presents an evaluation of the processor and studies how well it meets its objectives, identifying strengths and weaknesses.

Chapter 8 gives a summary of the work and then considers ways in which the architecture and implementation could be improved. Ways in which the successful SCALP techniques could be applied to more conventional processors are also considered.

Next Chapter