### **1-of-4 Globally Asynchronous Interconnect**

John Bainbridge, Steve Furber

**University of Manchester, UK** 

# Funded by Theseus Logic Inc.

- □ The AMULET3 SoC subsystem
- □ Problems with wire
- □ Latch controller performance
- DI alternatives to bundled data
- □ 1-of-4 Transition signalling
- □ 1-of-4 Level signalling
- Packet comms issues
- □ Example application: MARBLE replacement

### **The AMULET3 SoC Subsystem**



**AMULET** group, University of Manchester

# **Problems with wires**

- □ Wires have resistance, R
- □ Parallel wires are coupled
- □ Coupling mostly due to C
- Signal propagation speed depends upon RC time constant
- Data-dependent variation in propagation speed with parallel interconnections



### **Bundled Data Signalling**

|      | re   | quest   |
|------|------|---------|
| DER  |      |         |
| SEND |      |         |
|      | ackn | owledge |
|      |      |         |

### **Bundled Data Signalling & Crosstalk**



### **Bundled Data Signalling & Crosstalk**



### Delay variation (due to crosstalk) with wire length



The delay and its variation make it even more difficult to ensure bundling constraints are met - need 100% margin in the worst case & pay the worst case every time

#### **Bundled-Data Latch controller performance**



Normally-closed

Normally-open

|                                                                                       | Normally<br>Closed (ns) | Normally<br>Open (ns) |
|---------------------------------------------------------------------------------------|-------------------------|-----------------------|
| forward latency (input request rising $\rightarrow$ output request rising)            | 0.9                     | 0.6                   |
| input latching time (input request rising $\rightarrow$ input acknowledge rising)     | 1.7                     | 0.9                   |
| total input cycle time (input request rising $\rightarrow$ input acknowledge falling) | 2.2                     | 1.5                   |

### **Delay Insensitive Signalling**

|                 | acknowledge                                  | <b>&gt;</b> |
|-----------------|----------------------------------------------|-------------|
| Uses more w     | vire                                         |             |
| But easier time | ning closure, potentially higher performance |             |

# **Delay Insensitive Encodings**

- Remove the need for margin
- □ Allow exploitation of the average case delay
- □ BUT: Incur an area/power penalty
- □ AND: completion detection is required !

|             | Bits/wire | Transitions/bit (NRTZ) | 1-of-4 => 4 symbols  |
|-------------|-----------|------------------------|----------------------|
| single-rail | 1         | ?                      | can encode 2-bits    |
| dual-rail   | 1/2       | 1                      |                      |
| 1-of-4      | 2/4       | 1/2                    |                      |
| 3-of-6      | 4/6       | 3/4                    |                      |
| 2-of-7      | 4/7       | 2/4                    |                      |
| 3-of-8      | 5/8       | 3/5                    |                      |
| 2-of-9      | 5/9       | 2/5                    | 3-of-6 => 20 symbols |
|             |           |                        | can encode 4-bits    |

## 1-of-4 Encoding

| С<br>О      |                                                   |               |
|-------------|---------------------------------------------------|---------------|
|             |                                                   | ¥             |
|             |                                                   |               |
|             | acknowledge                                       |               |
| □ same wire | cost as dual rail, half the power, smaller acknow | vledge fan-in |

## 1-of-4 Encoding

### 2-Phase, Transition Signalling - Latches



### 2-Phase, Transition Signalling - Latches





### **Improved 1-of-4 Transition Signalling Pipeline Latch**



□ Still fairly costly - large XOR gates for completion detection

### 4-Phase, 1-of-4 Level Signalling Latch



- □ Simpler than transition signalling
- □ RTZ takes time, but no slow XOR gates

## 1-of-4 Encoding

| SENDER |                                                                                               | KTCTIVTK |
|--------|-----------------------------------------------------------------------------------------------|----------|
|        | acknowledge                                                                                   |          |
|        | ne wire cost as dual rail, half the power<br>I based signalling means worst case never arises |          |

### **Add repeaters**



Use latch stages instead of amplifiers



Slide 19 of 28

#### Use independent acknowledges for each 1-of-4 group



W.J.BainbridgeMarch 2001ASYNC'01Slide 20 of 28A

### Use independent acknowledges for each 1-of-4 group



#### **Multiplex connections for wide datapaths**



Can run (fewer) intermediate, narrow links at higher speeds

#### **Multiplex & Demultiplex Elements**





#### two-bit (1-of-4) Packet Switch



### **Packet Format**

- □ How do we detect the start/end of a packet?
  - □ add extra signalling wires
  - □ fixed length packets e.t.c



□ What network topology ?

□ ring / star / multiplex-demultiplex 'bus' type arrangement

### **Example: MARBLE Replacement**



# **Example: MARBLE Replacement**

- □ high performance requirement
- □ chain multiple 1-of-4 links in parallel
- □ 4 parallel links -> 4Gb/s
- □ must ensure that:
  - □ all select units switch packets to the same output
  - all merge units accept packets from the same input
- Use three such links for address, write-data and read-data



## Conclusion

- Bundled data unsuitable for global interconnect
  - □ crosstalk causes serious skew
  - □ bundling constraint hard to validate
- DI encoding much better
- □ 1-of-4 beats dual rail
  - □ same wiring cost
  - □ lower power
  - □ better fan in ?
- DI codes hard for multipoint buses, but easy on point-to-point
- □ P-P switched networks good for Globally Asynchronous interconnect (GALS?)
  - □ Buffer to increase speed
  - Go partly serial to reduce wiring cost
  - Network offers high bandwidth and possible concurrency
  - □ Improves with CMOS technology