Asynchronous Pipelining Techniques and Applications

Extended Abstract

The design of asynchronous systems has recently experienced a rapid and continuing growth amongst the scientific community primarily spurred by the increasing limitations which their synchronous equivalents are encountering. In particular, with the advent of reduced feature sizes for VLSI circuitry comes an increase in transistor density and a reduction in gate delays, which in turn enables faster clocking speeds and subsequently increased power dissipation. Designing and distributing this clock then becomes an arduous task whose complexity now arguably rivals that of the gate logic required to effect a desired algorithm. In particular, clock skewing, clock driver sizes, and signal routing become a prime concern. However, the asynchronous designer alleviates these problems by removing the clock completely. Instead, the control signalling between blocks is effected by local handshake communication protocols. This idea also results in other benefits, namely: the overall power dissipation of a chip is in general significantly reduced because in contrast to a synchronous system, inactive subcircuits do not operate when their functionality is not required; the potential for average case behaviour exists because each part of the system may operate with a delay which is independent of that of other, possibly slower parts; and the overall system speed can be improved merely by an incremental design improvement in only one part of the system.

There are however a number of problems associated with asynchronous systems. In particular, each individual pipeline stage of the design must have its associated control explicitly designed and tested to ensure hazard free operation, whereas for a synchronous system all control is governed by just the one, global clock signal which (theoretically) alleviates these hazards. Furthermore, although the power and area consumed by the asynchronous hand shake circuits are generally much less than is consumed by the datapath computations, the propagation delay imposed by the control logic between communicating stages can often be quite excessive and can severely limit the potential throughput (and therefore the overall system speed) of an asynchronous pipeline. Furthermore, the class of asynchronous design that this paper concentrates on is bounded-delay; the designer can quantify propagation delays in both gates and wires. By exploiting bounded-delay circuits asynchronous designers may use many circuit components common to synchronous design, for instance combinational logic blocks, data latches and flip-flops and reduce much of the overhead associated with delay-insensitive circuits.

This paper will introduce a range of delay-insensitive, speed-independent, and fundamental-mode event control led (two cycle) pipeline architectures which have been regularly used for the design of such systems as elastic FIFOs, microprocessors, and digital signal processors. The intention is to outline the advantages and applications of each and in particular focus on the relative latency and throughput of the pipeline. Such architectures include the simple yet effective `micropipeline' introduced by Ivan Sutherland, transparent and opaque latch control variations , and various precharged pipeline structures.

In addition, some interesting and novel implementations of the datapath will be introduced which include dynamic logic, completion detection, and matched delay modelling. Dynamic logic can be particularly useful in asynchronous circuits for low power, high speed applications because of the reduced nodal capacitances and resilience to spurious transitions. The precharge phase of the dynamic logic is used to block unnecessary transitions that occur. For example, if the datapath forks into two paths for speculative evaluation later, dramatic increases in power consumption can occur since data in one of the paths will be discarded. However, if dynamic logic is used the propagation of data is blocked by the dynamic logic until the fork to be taken enters its evaluation phase. Dynamic logic, in particular DCVSL, can also be used for self-timed calculations such as fast carry propagation in adders and incrementers.

When dealing with self-timed calculations a completion detection strategy is required. This paper will present a few alternative solutions to this problem and discusses their relative merits in terms of speed, power, area, and subsystem VLSI floorplanning. Alternatively, if a bounded delay approach is to be adopted then a suitable `matched path' or completion detection strategy is required. The `matched path' strategy can be effected by creating a `dummy' computation which indicates completion and mirrors the datapath computation (worst case) and therefore closely matches any process or environmental operating variations. Alternatively a lumped delay model may be used in the control path which can be powered by a separate, variable Vdd bus to give greater delay modelling flexibility (this is analogous to clock frequency variation in the synchronous realm). The final alternative to be presented is completion detection in data dependent operations such as addition and multiplication. Again, dynamic logic solutions can offer significant advantages with respect to performance, and hence power consumption, particularly when a four-phase handshake protocol is used since the return-to-zero delay of the completion signal is the time required for the circuit to return to precharge, (typically an the propagation delay of two inverters). The equivalent static circuit incurs a propagation delay through its full completion circuit.

Finally, this paper will provide a number of design examples which exploit the techniques outlined above including a blocksorter for video processing and the data path of a CD player error decoder.