The SCALP Instruction Issuer Problem

Introduction

SCALP is a superscalar asynchronous microprocessor. One of the key problems in superscalar processor design is being able to issue the instructions that have been fetched from the memory to the appropriate functional units in the right order and at a suficient rate. Being asynchronous only makes this harder.

SCALP instructions contain no register specifiers which removes one of the greatest problems, namely checking for dependencies between instructions. However SCALP does still have to ensure the correct ordering of instructions, and branches make this a little hard. SCALP also features variable length instructions which complicate matters.

This document describes the SCALP Instruction Issuer Problem. It is organised into three parts:

A description of the problem;
A description of my solution to that problem at the high level;
My implementation of that solution.

To keep things manageable I have chosen to temporarily hide two of the worst nightmares of the design, namely the operation of Branch instructions and the variable length instructions. This happens to be how I designed it - firstly I solved the simple problem and then added on the nasties. I describe these extra problems in additional sections at the end.

I hope that this problem may be of interest to asynchronous theorists looking for a real problem to apply their techniques to and also to asynchronous designers who want to compare design approaches. There are parts of this design that I'm not proud of (I'll let you work out which!) and I wouldn't be at all suprised if many of you come up with better solutions. So enjoy and let me know what you think!

Introduction
The Problem
My Solution
My Implementation
Adding Branch Instructions
Adding Variable Length Instructions

The Problem

SCALP instructions are 12 bits long. They come in groups of 5 which are fetched from memory at the same time. The instructions have two parts:

The first two bits indicate which of the four functional units must execute this instruction.
The remaining ten bits tell the functional unit what operation to perform.

The role of the instruction issuer is to receive groups of 5 instructions from the memory and to distribute them to the four functional units. The issuer has conventional asynchronous interfaces to the stream of groups coming from the memory and to the four functional units. One simple though unsatisfactory scheme is shown here:

In this arangement, when an instruction group arrives from memory each of its component instructions is considered in turn using a counter and a multiplexor. The functional unit indicating bits are separated off and used to control a four-way select element. The remaining ten bits are connected to each of the four functional units, but under the control of the select element only one of them receives a request signal.

The problem with this solution is that it imposes too much sequentiality. In a parallel processor like SCALP it is essential that no part of the system becomes a bottleneck; the throughput of each stage must be balanced. The memory interface is wide so several instructions can be fetched together and the four functional units can operate in parallel. It is necessary that the instruction issuer can also operate with some matching degree of parallelism.

So this is the problem: to build an instruction issuer that incorporates parallelism.

Note that although the order in which instructions for one particular functional unit must arrive at that functional unit is fixed, the order in which instructions are issued between the functional units may vary. For example, if an instruction group contains five instructions for these functional units

 0   1   2   3   4
F0  F0  F2  F2  F1

then initially instructions 0, 2 and 4 can be issued to functional units F0, F2 and F1 respectively. Subsequently instructions 1 and 3 can be issued.

The boundaries between instruction groups are also unimportant. For example if two groups contain instructions for these functional units

 0   1   2   3   4
F1  F1  F1  F1  F1
F0  F0  F0  F0  F0

then the issuer may choose to issue instruction 0 in the first group to functional unit F0 and simultaneously instruction 0 in the second group to functional unit F1.

A good solution will trade off the performance benefits of these forms of out-of-order issue against their implementation cost.

My Solution

I first divide the system into control and datapath sections. The datapath comprises a 5 by 4 crossbar switch that can connect any combination of the instructions from the incoming group to any of the functional units.

The incoming instruction groups are stored in what I refer to a "forked fifo"; this is an asynchronous fifo queue whose input is controlled by a single request/acknowledge pair and whose output has separate request and acknowledge signals for each instruction. This arangement means that once an instruction has been issued then it can be acknowledged and removed from the output of the fifo, making the instruction behind it in the next instruction group visible and eligable for issue.

The control is then divided up into one unit coresponding to each of the functional units. These units "watch" the instructions at the front of the forked fifo and when they see an instruction for their functional unit they activate the appropriate switch in the crossbar and send a request to the functional unit.

This arangement is shown in the following figure:

Each functional unit controller can see the following information about each instruction at the front of the forked fifo:

From the state of the request and acknowledge signals, whether any valid instruction is present.
From the decoded functional unit bits in the instuction (I call these the "for me" signals) whether this instruction is for this functional unit.

With this information the functional unit controller would like to be able to choose which if any of the instructions it may issue at this time. If none of the instructions is "for me" then it can do nothing. If one of the instructions is "for me" then that one may be issued. If more than one instruction is "for me" then the first one in program order may be issued.

Unfortunately this information is not quite suficient because of the way that the forked fifo works. Consider the following two cases:

 0   1   2   3   4
F0  F1  F2  F3  F0
F0  F1  F0  F1  F2

 0   1   2   3   4
F0  F1  F2  F3  F0
        F0  F1  F2

In the first case the front row of the forked fifo is full. In the second case the first two instructions from the front row have already been issed making the first two instructions from the second row visible.

Note that in both cases the instructions visible to the controllers are the same: F0, F1, F0, F1, F2. Consider the controller for functional unit 0. In the first case the correct behavior is for it to issue instruction 0. In the second case the correct behavior is for it to issue instruction 2.

Here is another example:

 0   1   2   3   4
F0  F1  F0  F2  F2
F0  F1  F0  F1  F2

 0   1   2   3   4
F0  F1  F0  F2  F2
F0  F1  F0  F1

Once again the instructions visible to the controllers are the same: F0, F1, F0, F1, F2. Consider the controller for functional unit 2. In the first case the correct behavior is to for it to issue instruction 4. In the second case the next instruction in program order for functional unit 2 is instruction 3 of the second group; an instruction which is not visible yet.

To solve these problems I associate a modulo-3 counter controlled by the request and acknowledge signals with each column of the forked fifo. The controllers see the states of these counters and are able to tell which instructions belong logically before others. I use letters A, B, C to indicate the state of the counter and refer to its value as the "layer" in which the instruction belongs; this is orthogonal to the rows and columns of the crossbar.

The algorithm that the controllers have to implement is then as follows:

At reset, the "current column" is column 0 and the "current layer" is layer A
If the instruction in the current column is not valid (i.e. request is not asserted; it has not arrived) then wait.
If the instruction in the current column is for the previous layer then wait.
If the instruction in the current column is for the current layer and it is for this functional unit then issue it.
If the instruction in the current column is for the current layer but it is not for this functional unit then consider the next instruction.
If the instruction in the current column is for the next layer then consider the next instruction.
When the instruction in column 4 has been considered, go on to column 0 of the next layer.

The algorithm described here is sequential in nature, but it could be implemented in a parallel fashion using parallel priority encoders and similar techniques.

After considering several possibilities I decided that the best implementation is in fact a hybrid sequential / parallel implementation. The principle of operation is this: for each column and each layer a block of logic precomputes what it would do if this instruction and layer were the "current" instruction and layer. These precomputations can all be carried out in parallel. A "token" is then passed between the blocks making them the current block for a brief time. Since the action that must be carried out when the token has arrived has been precomputed the token can progress from block to block very quickly.

The organisation of the blocks and the token's path between them is shown in the following figure:

The internal organisation of each of the blocks is shown here:

At the ends of the various rows columns and layers things like OR gates are required to merge the signals that may come from several blocks within the structure.

And so to the implementation:

My Implementation

Token Logic

The design of the token logic proved one of the most interesting parts of the design. SCALP is built using 4-phase request and acknowledge signals as AMULET experience shows 4-phase structures to be faster and smaller than equivalent 2-phase structures. Having built a 4-phase implementation of the token logic I presented the design to the AMULET group meeting. Robert Kelly, who worked on the "TAMULET" 2-phase ECL implementation of the asynchronous ARM, sugested that I consider using 2-phase signalling to implement the token signals. He sketched a circuit that is faster, smaller and conceptually cleaner than the 4-phase implementation. So here it is, the token logic in 2-phase.

The requirements can be described by this psuedo-STG. The signals has_token and pass_token are 4-phase signals and their rising and falling transitions are shown. The token_in and token_out signals are 2-phase and rising and falling transitions are equivalent.

I say psuedo-STG because this STG has some unpleasant properties; it allows the next token to arrive before has_token has been cleared, for example. I know that in my implementation the token has to pass through 14 other blocks, some of which are going to block it, before it gets back here. I therefore choose to ignore some of these problems.

My implementation is as follows:

The component on the right is "half a 2-phase toggle"; alternate transitions on the input cause transitions on the output. In can be implemented with a simple d-type flip flop.

Operation is as follows: when no token is present, the input and output and the same so the output of the xor gate is 0. When a token arrives on the input, the xor gate output becomes high asserting has_token. When pass_token is asserted the output is toggled, transmitting a token to the next stage and also inverting the input to the xor gate. This causes has_token to become 0 once again.

Getting the reset behavior correct is important. For all blocks the "half 2-phase toggle" element's internal state must be cleared such that its output is low and the first rising edge of pass_token will cause an output transition. Additionally the block in column 0, layer A must generate an initial token as reset is removed. I do this by adding an extra xor gate in the path from token_out to the xor gate.

As an aside, the reason why the 4-phase implementation was more complex is that it needed acknowledge signals for the tokens. The 2-phase circuit manages without them.

Decoding Logic

The purpose of the decoding logic is to decide on the basis of the for_me, layer, and valid signals what should happen when the token arrives. It has two outputs, "issue" and "pass". "issue" indicates that when the token arrives the instruction in this column should be issued and the token passed on. "pass" indicates that when the token arrives it should be passed on without issuin anything. If neither is asserted, when the token arrives nothing should happed i.e. the block waits. The function that the block has to perform is as follows:

Inputs                     Interpretation               Outputs

Valid   Layer     For_me                                Issue   Pass

X       Previous  X        Can't see the instruction
                           we want; wait                0       0

0       This      X        Must wait and see if this
                           instruction is for me        0       0

1       This      0        This instruction not for me  0       1

1       This      1        For me; issue it             1       0

X       Next      X        Don't worry about this yet   0       1

Unfortunately being an asynchronous system this can't be implemented with a simple combinational function. We have to worry about glitches on the Issue and Pass outputs and think about the order in which the inputs can change.

The valid signal is ( request AND NOT acknowledge ). Assume that the AND NOT function is hazard free and fast.

The for_me signal is a bundled data signal that will be set up before valid is asserted and is held until after it is disasserted. While valid is low for_me can glitch.

The layer signals are driven by a counter controlled by the request and acknowledge signals. I arange the counter so that the layer signals change in response to acknowledge rising, which coresponds to valid falling; in fact valid and the layer signals can change at around the same time. This isn't a problem because both valid rising first and this_layer rising first lead to the same sequence of outputs. However it does mean that the circuit has to be hazard free with respect to these changes.

My implementation of the decode logic is as follows:

Sequencing Logic

The function of the sequencing logic is to interface do what the decode logic says must be done when the token arrives. This means either passing the token on by asserting pass_token or actually issuing an instruction.

The case of just passing the token is quite simple; when "pass" and "has_token" are both asserted together, assert "pass_token".

Issuing the instruction involves more steps and some questions. For example, at what point should the token be passed on? Can it be passed when issuing starts or must it wait until it has completed? There are similar and important questions about the order in which things can be disaserted.

My solution is described by the following schematic:

(Note that asymetric Muller-C elements are represented by a gate symbol with extensions to one or both sides. Inputs joining the body of the gate take part in both rising and falling transitions of the output. Inputs joining a side marked with a + or - take part in only that transition.)

Adding Branch Instructions

Adding Variable Length Instructions

If you are interested these sections are described in my Thesis.