SCALP is motivated by three objectives: high code density, highly parallel operation, and asynchronous implementation. These three objectives should all lead to increased power efficiency and also dictate the nature of the architecture.
Conventional instruction sets indicate the flow of data between instructions by means of register numbers; each instruction gives register numbers for its operands and result. This technique leads to complexity in pipelined superscalar implementations: register numbers from many instructions must be compared to identify dependencies and activate appropriate forwarding paths. In asynchronous systems these forwarding paths cause further inefficiency due to the additional synchronisation that they impose. Furthermore the register specifier bits in a typical instruction set take up around half of the total instruction, making them crucial to code density.
SCALP's main architectural innovation is its use of "explicit forwarding". SCALP does not use a global register bank but rather indicates for each instruction to where the result should be sent. This takes the form of a destination queue identifier. One such queue is associated with each operand required by each functional unit. Using this scheme the need for many register specifier comparators is eliminated. The arrangement also suits asynchronous implementation and increases code density.
SCALP has other features aimed at increased code density or other ways of increasing power efficiency: it has variable length instructions and operations need only activate part of the datapath for byte operations.
After describing the background to and features of the proposed architecture the implementation is described. SCALP is the first known asynchronous implementation of a superscalar architecture. It operates using the four-phase bundled data protocol like the AMULET2 processor. Much of the design uses a "macromodule" design approach, and the total size of the design is around 9,500 components. The design of some of the more interesting parts of the implementation, such as the parallel asynchronous instruction issuer, are described in detail.
The implementation has been taken to gate level using the hardware description language VHDL. The implementation is extensively evaluated in comparison with a conventional instruction set and conclusions about its effectiveness are drawn.
An online version is also available.