School of Computer Science Intranet
AMULET3i contains a number of addressable devices; probably the most interesting of these is the `local' RAM. This comprises 8Kbytes of SRAM which is located between the processor core and the MARBLE bus. In this device this RAM is memory-mapped and is not configurable as a cache memory (although this will change in future devices). However the memory does have some novel features.
The memory is composed of eight 1Kbyte blocks of RAM, each block having two independent access ports. At this level the processor system has separate instruction and data buses, although the memory map is unified. The RAM blocks therefore appear as dual-port memories to the processor. Internally the RAMs are single ported (to reduce their physical size) and - in the event of two access requests being present at the same time - arbitration occurs and the later arriving cycle is stretched to allow the earlier one to complete. This happens automatically and transparently in the asynchronous handshake world; no clock gating or wait signals are required.
If the memory were a single block such collisions would be quite frequent; however with eight independent (interleaved) blocks it is unusual for a collision to occur and most cycles will proceed at full speed. This gives most of the advantages of a dual-port memory with little of the silicon overhead.
To increase this parallelism further each RAM block is provided with a single line (4 word) cache on each of its bus interfaces. These latches are present as a consequence of the internal architecture and are used to both accelerate and lower the power consumption of repeat accesses to a line. Because there are eight such latches (one per block) an appreciable number of RAM cycles can make use of these without activating the RAM array. This saves power and provides a faster read cycle; the latter advantage is exploited automatically in the asynchronous system. With separate latches for the instruction and data buses (essentially small, separate instruction and data caches) RAM access collisions are reduced even further. The tight coupling of these "caches" to the RAM allows easy invalidation by write cycles, thus ensuring the unified memory model is preserved.
The RAM latches also act as a pipeline stage within the memory. Thus as soon as a RAM has completed or decided to bypass a read cycle (write cycles are "fire and forget") the processor is free to output the next address. This occurs in parallel with the RAM driving the return data bus and the processor latching the returned instruction/data.
The two RAM ports are not identical. The instruction port is read-only and is optimised for data throughput. The data port is more capable and, as a consequence, has a slightly lower bandwidth; its optimisations are aimed at reducing the latency of a fetch rather than the overall cycle time. This reflects the typical requirements of the two types of memory access. The result is that the memory cycle times are of different (and varying) lengths; again this has no impact on the function of the system.
A final feature of the RAM system is that it is also dual master. The data bus has an arbiter which allows the on-chip MARBLE bus to take control. This allows external accesses for test purposes, external loading of code and data and DMA transfers.
Unfortunately at present the RAM cycle time does not keep pace with the processor core. This due to limited development resource (time, mostly) and will be addressed in the future, as will the issue of cache. The RAM (like the rest of the system) is also designed on process-portable, ASIC rules which gives manufacturing flexibility at the price of access time. Fortunately the greatest limitation (instruction fetch bandwidth) can be alleviated by using Thumb code which provides twice as many instructions in the same time.
We'll give some more concrete figures when we have them!
The AMULET3 development is supported by the OMI/ATOM project. The designers wish to acknowledge this support from the CEC.