

#### Outline:

- O the ARM710T, 720T and 740T
- O the SA-110 StrongARM
- O the ARM920T and 940T
- O the ARM1020E
- the ARM1176JZF

hands-on: code profiling



- This section contains ARM macrocells examples
- Many variants exist in each family
  - O this should give the broad picture
  - See product manuals for specific details
- Variations:
  - O cache size (and organisation)
    - especially in synthesizable macrocells ("-s")
  - Instruction set enhancements included
  - O coprocessors etc. included



#### Outline:

#### → the ARM710T, 720T and 740T

- O the SA-110 StrongARM
- O the ARM920T and 940T
- O the ARM1020E
- O the ARM1176JZF

hands-on: code profiling



### ARM710T, 720T and 740T

All have ...

• an ARM7TDMI processor core, with:

• an 8 Kbyte instruction & data cache

- 4-way set-associative, 16-byte lines
- write-through
- an AMBA interface
  - shared by instruction and data ports
- an 8 data word, 4 address write buffer



### ARM710T, 720T and 740T

- □ The ARM710T and 720T have ...
  - O an MMU with a 64-entry TLB
- □ The ARM720T also has WinCE support
  - O ProcessID register relocates the 1st 32 Mbytes of memory
  - exception vectors relocatable to 0xffff0000
- □ The ARM740T has ...
  - a simple memory protection unit
    - there is no address translation system

### **ARM710T and 720T organization**



MANCHEstER 1824



The University of Manchester

#### **ARM740T organization**



# The ARM7x0T cache organization



The University of Manchester

ARM CPUs - v5 - 8



#### The ARM7x0T write buffer



One address can be associated with several data elements

#### O accommodates STM



**MANCHEstER** 

1824

#### **ARM710T, 720T:**

Process 0.35 µm Metal layers 3 Vdd 3.3V

Transistors N/A Core area 11.7 mm<sup>2</sup> Clock 0 to 59 MHz MIPS 53 Power 240 mW MIPS/W 220

#### **ARM740T**:

Process 0.35 um Metal layers 3 Vdd 3.3V Transistors N/A Core area 9.8 mm<sup>2</sup> Clock 0 to 59 MHz MIPS 53 Power 175 mW MIPS/W 300



#### Outline:

- O the ARM710T, 720T and 740T
- the SA-110 StrongARM
- O the ARM920T and 940T
- O the ARM1020E
- the ARM1176JZF

hands-on: code profiling



### StrongARM

- The University of Manchester
- Developed by Digital
  - O in collaboration with ARM Ltd; later bought by Intel
- **Given Seatured:** 
  - O 5-stage pipeline
  - O 16 Kbyte I cache, 16 Kbyte D cache
    - both 32-way associative, 8 word line
  - high clock rate
  - O low power consumption
    - due to low voltage (down to 1.5 V) operation

Now obsolete but demonstrated that ARMs need not be slow!



#### Outline:

- O the ARM710T, 720T and 740T
- O the SA-110 StrongARM
- the ARM920T and 940T
- O the ARM1020E
- O the ARM1176JZF

hands-on: code profiling



### ARM920T

The ARM920T is ...

• an ARM9TDMI processor core, with:

- a **16** Kbyte instruction cache
  - 64-way associative, 8 words/line
- a **16** Kbyte copy-back data cache
  - 64-way associative, 8 words/line
- an AMBA interface
  - shared by instruction and data ports
- an 8 word, 4 address write buffer
- an **MMU** with a 64-entry TLB
- CP14 debug coprocessor



### ARM940T

The ARM940T is ...

• an ARM9TDMI processor core, with:

- a 4 Kbyte instruction cache
  - smaller, but otherwise similar to ARM920T
- a 4 Kbyte copy-back data cache
- an AMBA interface
  - shared by instruction and data ports
- an 8 word, 4 address write buffer
- a simple memory protection unit
  - there is no address translation system
- CP14 debug coprocessor



#### **ARM920T organization**





The University of Manchester

#### **ARM940T organization**





#### **Characteristics**

#### □ ARM920T:

Process 0.25 µm Metal layers 4 Vdd 2.5V Transistors 2,500,000 Core area 23-25 mm<sup>2</sup> Clock 0 to 200 MHz

MIPS 220 Power 560 mW MIPS/W 390

#### **ARM940T**:

Process 0.25 µm Metal layers 3 Vdd 2.5V Transistors 802,000 Core area 8.1 mm<sup>2</sup> Clock 0 to 200 MHz MIPS 220 Power 385 mW MIPS/W 570



#### Outline:

- the ARM710T, 720T and 740T
- O the SA-110 StrongARM
- O the ARM920T and 940T

#### the ARM1020E

• the ARM1176JZF

hands-on: code profiling



### **ARM1020E**

#### A CPU macrocell based on the ARM10TDMI

#### O 32 Kbyte 64-bit I- and copy-back D-caches

- 64-way associative, 8 words/line
- O AMBA AHB bus interface
- O < … list of other things assumed …>

#### □ ARM1020E target characteristics:

Process 0.18 µm Metal layers 5 Vdd 1.5V Transistors 7,000,000 Core area 12 mm<sup>2</sup> Clock 0 to 400 MHz MIPS 500 Power 400 mW MIPS/W 1250



#### Outline:

- O the ARM710T, 720T and 740T
- O the SA-110 StrongARM
- O the ARM920T and 940T
- O the ARM1020E
- → the ARM1176JZF

hands-on: code profiling



The University of Manchester



#### MANCHEstER 1824

- The University of Manchester
- ARM11 processor core
  - O separate instruction and data buses
  - O EmbeddedICE
  - O debug coprocessor (CP14)
- Instruction and data MMUs
  - O two-level TLB
- Vector floating point coprocessor
- Vectored interrupt controller
- Embedded trace



- Up to 64 Kbytes of L1 instruction and data caches
  - 64-bit wide interface
  - 4-way set-associative (less than previous ARMs)
  - O 8 words/line
  - O write through or copy back
  - O pseudo-random or round-robin replacement
  - O lockdown each 'way' can be locked
  - sequential access mode (to save power)



- The University of Manchester
- Tightly coupled memory (TCM)
  - O configuration of part/all of cache memory {4, 8, 16, 32, 64KB}
  - O physically addressed
  - O supported by internal DMA
    - 2 channels (only one active at once)
    - fast transfer between TCM and memory (*not* L1 cache)



## **XScale<sup>®</sup> features**

- An ARM v5TE implementation designed by Intel
  - "7-8 stage Superpipelined RISC"
- Some additional features:
  - O a DSP coprocessor (CP0)
    - contains a 40-bit accumulator
    - eight new instructions.
  - O new page attributes
  - O coprocessor 15 additional functionality
  - coprocessor 14 (performance monitoring/software debug)

## **XScale<sup>®</sup> features**

• ARM v5TE (plus extensions)

- dynamic voltage/frequency scaling
- multiply-accumulate coprocessor
- 128-entry static 2-level BTB
- 32 KB instruction cache
- 32 KB data cache
  - 2 KB mini-data cache
  - hit-under-miss operation with data caches
- performance monitoring unit
  - two 32-bit event counters
  - one 32-bit cycle counter
- debug unit trace buffer

**MANCHEstER** 



### Some XScale variants

#### PXA255

O 32KB data and instruction caches

- plus 2KB 'mini data cache'
- **O** 200MHz 400 MHz
- SDRAM/Flash interfaces
- **)** I/O
  - USB client
  - 1.84MHz cellular interface
  - PCMCIA controller
  - etc.



The University of Manchester

**PXA255** 





#### **Some XScale variants**

□ PXA27x

- PXA270 up to 624 MHz
- PXA271 32MB Flash 32MB SDRAM 416MHz
- PXA272 64MB up to 520MHz
  - stacked, multichip modules
- Wireless MMX
- O lots of peripherals
  - clearly aimed at telephone/PDA applications

The University of Manchester

#### **XScale DSP extensions**

#### DSP coprocessor (CP0)

• separate multiply accumulate with 40-bit accumulator

Wireless MMX

- O 43 new SIMD instructions
- O 64-bit datapath
- O for "multimedia" and games

#### Different evolutionary path from ARM Ltd.

ARM CPUs - v5 - 31



### Dynamic voltage/ frequency scaling

#### In CMOS:

- O speed is roughly proportional to supply voltage
- O power consumption is *roughly* proportional to supply voltage squared
  - therefore a lower supply voltage gives **better energy efficiency**

#### Xscale exploits this:

| Clock frequency | "MIPS" | Power (mW) | MIPS/W |
|-----------------|--------|------------|--------|
| 150MHz          | 185    | 40         | 4600   |
| 600MHz          | 750    | 450        | 1700   |
| 800MHz          | 1000   | 900        | 1100   |



### Discussion

- Speed needs memory bandwidth:
  - Single 32-bit I/D cache on ARM7x0T
  - Split 32-bit I- & D-caches on ARM9x0T
  - Split 64-bit I- & D-caches on ARM1020E
  - Split 64-bit I- & D-caches on ARM1176JFZ
    - plus provision for split L2 cache
- Memory management is high cost
  - O greater silicon area
  - O protection unit on ARMx40T is cheaper

### Discussion

#### Cache & TLB lock-down helps real-time operation

- O easier to support with CAM-RAM cache
- TCM aids performance *predictability* 
  - enhanced by DMA
- Many parts now highly integrated

Can trade ultimate performance for *energy* efficiency
XScale



### Hands-on: code profiling

- Investigate the profiling facilities in the ARM toolkit
  - see what proportion of time a program spends in different routines

Follow the 'Hands-on' instructions