

#### Study of the data exchange between PL and PS of Zynq-7000 devices

Rodrigo A. Melo, Bruno Valinoti (INTI) Marie Baly Amador, Luis G. García (ICTP) Andres Cicuttin, Maria Liz Crespo (ICTP)

**ICTP** 





## **Motivation**

FPGA SoC:

- In 2010 Actel (later Microsemi, now Microchip) introduced SmartFusion (ARM Cortex-M3).
- In 2011 Xilinx introduced Zynq-7000 and Altera (now Intel Programmable Solutions Group) some variants of Cyclone/Arria (2 x ARM Cortex-A9).

Previous attempts:

- Excalibur from Altera (ARM 9 and MIPS microcontrollers)
- Virtex-II and Virtex-4 Pro from Xilinx (embedded PowerPC from IBM)

The uP approach has a lowest integration level and lack of peripherals. The FPGA SoC solution integrates the software programmability of state of the art processors, capable of run an operating system, with a huge variety of general purpose and high speed peripherals, and several memory controllers, with the flexibility and scalability of programmable hardware into a single device.





#### Advanced Microcontroller Bus Architecture

An open standard for the connection and management of functional blocks in a SoC.



- AMBA 1 (1996): Advanced Peripheral Bus (APB)
- AMBA 2 (1999): AMBA High-performance Bus (AHB)
- AMBA 3 (2003): Advanced Extensible Interface (AXI3)
- AMBA 4 (2010): AXI4

Xilinx was one of the thirty-five companies that contributed with the AMBA 4 specification and an early adopter.

Source: ARM AMBA 4 Specification maximizes performance and power efficiency (press release)





## AXI 3 vs 4

Masters and slaves in the PS are AXI 3, but hardware in the PL is suggested to be AXI 4.

The maximum burst length was extended from 16 to 256 beats (INCR type). Additionally, AXI 4 defines three interfaces:

- AXI4 (also known as AXI4-Full) for high-performance memory-mapped requirements.
- AXI4-Lite for simple, low-throughput memory-mapped communication (such as control and status registers).
- AXI4-Stream for high-speed streaming data (removes address phase and allows unlimited data burst size).





#### **Vivado AXI Infrastructure**







#### Zynq-7000 All Programmable SoC Overview



Source: Zynq-7000 All Programmable SoC Technical Reference Manual (UG585)

- Cortex-A9 MPCore (r3p0)
- 2 x 32b General Purpose masters (M\_AXI\_GP[1:0])
- 2 x 32b General Purpose slaves (S\_AXI\_GP[1:0])
- 4 x 32/64b High Performance slaves (S\_AXI\_HP[3:0])
- 1 x 64b Accelerator Coherency Port slave (S\_AXI\_ACP)





#### More about AXI ACP and HP



Source: Zynq-7000 All Programmable SoC Technical Reference Manual (UG585)





#### **Data Movement Method Comparison Summary**

| Method             | Benefits                                                                                                               | Drawbacks                                                                                                                | Suggested Uses                                                                                                                  | Estimated<br>Throughput       |                                               |
|--------------------|------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|-------------------------------|-----------------------------------------------|
| CPU Programmed I/O | <ul> <li>Simple Software</li> <li>Least PL Resources</li> <li>Simple PL Slaves</li> </ul>                              | Lowest Throughput                                                                                                        | Control Functions                                                                                                               | <25 MB/s                      |                                               |
| PS DMAC            | <ul> <li>Least PL Resources</li> <li>Medium Throughput</li> <li>Multiple Channels</li> <li>Simple PL Slaves</li> </ul> | <ul> <li>Somewhat complex<br/>DMA programming</li> </ul>                                                                 | Limited PL<br>Resource DMAs                                                                                                     | 600 MB/s                      | *                                             |
| PL AXI_HP DMA      | <ul> <li>Highest Throughput</li> <li>Multiple Interfaces</li> <li>Command/Data FIFOs</li> </ul>                        | <ul> <li>OCM/DDR access only</li> <li>More complex PL<br/>Master design</li> </ul>                                       | <ul> <li>High Performance<br/>DMA for large<br/>datasets</li> </ul>                                                             | 1,200 MB/s<br>(per interface) | MB/s                                          |
| PL AXI_ACP DMA     | <ul> <li>Highest Throughput</li> <li>Lowest Latency</li> <li>Optional Cache<br/>Coherency</li> </ul>                   | Large burst might cause<br>cache thrashing     Shares CPU<br>Interconnect bandwidth     More complex PL<br>Master design | <ul> <li>High Performance<br/>DMA for smaller,<br/>coherent datasets</li> <li>Medium<br/>granularity CPU<br/>offload</li> </ul> | 1,200 MB/s                    | * PL Free<br>* Data w<br>Where is<br>overhead |
| PL AXI_GP DMA      | • Medium Throughput                                                                                                    | <ul> <li>More complex PL<br/>Master design</li> </ul>                                                                    | <ul> <li>PL to PS Control<br/>Functions</li> <li>PS I/O Peripheral<br/>Access</li> </ul>                                        | 600 MB/s                      |                                               |

 $MB/s = MHz * \frac{bits}{8}$ 

\* PL Freq. is 150 MHz \* Data width is 32/64 bits Where is the protocol overhead?

Source: Zynq-7000 All Programmable SoC Technical Reference Manual (UG585)





#### System-Level Address Map

| Address Range                         | CPUs and<br>ACP | AXI_HP | Other Bus<br>Masters <sup>(1)</sup> | Notes                                                    |
|---------------------------------------|-----------------|--------|-------------------------------------|----------------------------------------------------------|
|                                       | OCM             | осм    | OCM                                 | Address not filtered by SCU and OCM is<br>mapped low     |
| 0000 0000 to 0003 FFFF <sup>(2)</sup> | DDR             | осм    | OCM                                 | Address filtered by SCU and OCM is<br>mapped low         |
| 0000_0000 to 0003_FFFF (*)            | DDR             |        |                                     | Address filtered by SCU and OCM is not<br>mapped low     |
|                                       |                 |        |                                     | Address not filtered by SCU and OCM is<br>not mapped low |
| 0004 0000 to 0007 FFFF                | DDR             |        |                                     | Address filtered by SCU                                  |
| 0004_0000 to 0007_FFFF                |                 |        |                                     | Address not filtered by SCU                              |
| 0008 0000 to 000F FFFF                | DDR             | DDR    | DDR                                 | Address filtered by SCU                                  |
| 0008_0000 to 000F_FFFF                |                 | DDR    | DDR                                 | Address not filtered by SCU <sup>(3)</sup>               |
| 0010_0000 to 3FFF_FFFF                | DDR             | DDR    | DDR                                 | Accessible to all interconnect masters                   |
| 4000_0000 to 7FFF_FFF                 | PL              |        | PL                                  | General Purpose Port #0 to the PL,<br>M_AXI_GP0          |
| 8000_0000 to BFFF_FFF                 | PL              |        | PL                                  | General Purpose Port #1 to the PL,<br>M_AXI_GP1          |
| E000_0000 to E02F_FFFF                | IOP             |        | IOP                                 | I/O Peripheral registers, see Table 4-6                  |
| E100_0000 to E5FF_FFFF                | SMC             |        | SMC                                 | SMC Memories, see Table 4-5                              |
| F800_0000 to F800_0BFF                | SLCR            |        | SLCR                                | SLCR registers, see Table 4-3                            |
| F800_1000 to F880_FFFF                | PS              |        | PS                                  | PS System registers, see Table 4-7                       |
| F890_0000 to F8F0_2FFF                | CPU             |        |                                     | CPU Private registers, see Table 4-4                     |
| FC00_0000 to FDFF_FFFF <sup>(4)</sup> | Quad-SPI        |        | Quad-SPI                            | Quad-SPI linear address for linear mode                  |
| FFFC 0000 to FFFF FFFF <sup>(2)</sup> | OCM             | OCM    | OCM                                 | OCM is mapped high                                       |
| FFFC_0000 to FFFF_FFFF (*)            |                 |        |                                     | OCM is not mapped high                                   |

Source: Zynq-7000 All Programmable SoC Technical Reference Manual (UG585)





## **Zynq AXI Configurations**

| Page Navigator –        | PS-PL Configuration                           | Summary Report |                                                             |  |  |  |  |  |
|-------------------------|-----------------------------------------------|----------------|-------------------------------------------------------------|--|--|--|--|--|
| Zynq Block Design       | ← Q                                           |                |                                                             |  |  |  |  |  |
| PS-PL Configuration     | Search: Q-                                    |                |                                                             |  |  |  |  |  |
| Peripheral I/O Pins     | Name                                          | Select         | Description                                                 |  |  |  |  |  |
| renpretere contra       | > General                                     |                |                                                             |  |  |  |  |  |
| MIO Configuration       | <ul> <li>AXI Non Secure Enablement</li> </ul> | 0 ~            | Enable AXI Non Secure Transaction                           |  |  |  |  |  |
| Clock Configuration     | V GP Master AXI Interface                     |                |                                                             |  |  |  |  |  |
| Clock Configuration     | > M AXI GP0 interface                         | 2              | Enables General purpose AXI master interface 0              |  |  |  |  |  |
| DDR Configuration       | > M AXI GP1 interface                         |                | Enables General purpose AXI master interface 1              |  |  |  |  |  |
| SMC Timing Calculation  | <ul> <li>GP Slave AXI Interface</li> </ul>    |                |                                                             |  |  |  |  |  |
| sinc mining calculation | S AXI GP0 interface                           | 2              | Enables General purpose 32-bit AXI Slave interface 0        |  |  |  |  |  |
| Interrupts              | S AXI GP1 interface                           |                | Enables General purpose 32-bit AXI Slave interface 1        |  |  |  |  |  |
|                         | V HP Slave AXI Interface                      |                |                                                             |  |  |  |  |  |
|                         | S AXI HP0 interface                           | 2              | Enables AXI high performance slave interface 0              |  |  |  |  |  |
|                         | S AXI HPO DATA WIDTH                          | 64 V           | Allows HP0 to be used in 32/64 bit data width mode          |  |  |  |  |  |
|                         | > S AXI HP1 interface                         |                | Enables AXI high performance slave interface 1              |  |  |  |  |  |
|                         | > S AXI HP2 interface                         |                | Enables AXI high performance slave interface 2              |  |  |  |  |  |
|                         | > S AXI HP3 interface                         |                | Enables AXI high performance slave interface 3              |  |  |  |  |  |
|                         | ACP Slave AXI Interface                       |                |                                                             |  |  |  |  |  |
|                         | S AXI ACP interface                           | 2              | Enables AXI coherent 64-bit slave interface                 |  |  |  |  |  |
|                         | Tie off AxUSER                                | 1              | Tie off AxUSER signals to high, enabling coherency when all |  |  |  |  |  |
|                         | > DMA Controller                              |                |                                                             |  |  |  |  |  |
|                         | > PS-PL Cross Trigger interface               |                | Enables PL cross trigger signals to PS and vice-versa       |  |  |  |  |  |
|                         |                                               |                |                                                             |  |  |  |  |  |

To enable cache coherency with ACP, the AXI signals AxCACHE must be **XX11** and AxUSER must have all its bits tie high.





#### **Developed IPs**









#### **AXI3 Burst Sniffer**



SLOT0-3 are AXI3 interfaces in monitor mode, which have only INPUT ports.

| ces                      | Proj                                         | ect Summary × Package IP - | AXI3_BurstSniffer ×  |                   |  |  |
|--------------------------|----------------------------------------------|----------------------------|----------------------|-------------------|--|--|
| Sources                  | Pa                                           | ckaging Steps              | Ports and Interfaces |                   |  |  |
| s.                       | 4                                            | Identification             | Q   ¥   €   +   ⊕    | C                 |  |  |
| Bus Interface Properties |                                              | Compatibility              | Name                 | Interface<br>Mode |  |  |
| Pro                      | 5                                            | File Groups                | > 🕀 S_AXIL           | slave             |  |  |
| ace                      | <ul> <li>File Groups</li> </ul>              | The oroups                 | > 🕕 SLOT_0           | monitor           |  |  |
| terf                     | <ul> <li>Customization Parameters</li> </ul> | > 🕀 SLOT_1                 | monitor              |                   |  |  |
| с<br>s                   |                                              | 5 J J J J                  | > 🕀 SLOT_2           | monitor           |  |  |
|                          | <b>•</b>                                     | Ports and Interfaces       | > 🕀 SLOT_3           | monitor           |  |  |
| ≞                        | 4                                            | Addressing and Memory      | ✓                    |                   |  |  |
|                          |                                              |                            | > 🚯 aresetn          | slave             |  |  |
|                          | 1                                            | Customization GUI          | > 🕀 aclk             | slave             |  |  |
|                          |                                              | Review and Package         |                      |                   |  |  |





#### **Block Designs**







#### Cycles measurement in the PS







#### Cycles measurement in the PS

```
int data[ROWS][COLS] __attribute__ ((aligned (32)));
...
int row, col;
```

pl\_cycles = data[row][COLS-1]-data[row][0]

```
#include "xtime_l.h"
...
XTime tStart[ROWS], tEnd[ROWS];
...
XTime_GetTime(&tStart[row]);
...
// do something to be measured here
XTime_GetTime(&tEnd[row]);
...
ps_cycles = 2 * (tEnd[0]-tStart[0]);
```

$$MB/s = rac{FREQUENCY * SAMPLES * BYTES}{CYCLES}$$





#### **Zynq Interfaces Summary**







#### **Measured cycles**

| Test Case |                       |       |     | tween D | Data | Per Frame      |               |       |
|-----------|-----------------------|-------|-----|---------|------|----------------|---------------|-------|
| Interface | Variant               | Burst | min | typ     | max  | PS (MB/s)      | PL (MB/s)     | PS/PL |
| EMIO      | GPIO (XGpioPs_Read)   | No    | 20  | 21      | 29   | 96954 (27.46)  | 22358 (27.48) | 4.33  |
| EMIO      | GPIO (Xil_In32)       | No    | 20  | 20      | 31   | 92502 (28.78)  | 21330 (28.80) | 4.33  |
| M_AXI_GP  | AXI Lite (Xil_In32)   | No    | 28  | 28      | 33   | 124386 (21.40) | 28689 (21.41) | 4.33  |
| M_AXI_GP  | AXI Full (Xil_In32)   | No    | 24  | 24      | 26   | 106588 (24.97) | 24581 (24.99) | 4.33  |
| M_AXI_GP  | AXI Lite (memcpy)     | No    | 19  | 20      | 31   | 90973 (29.26)  | 20974 (29.29) | 4.33  |
| M_AXI_GP  | AXI Full (memcpy)     | No    | 15  | 16      | 25   | 73336 (36.30)  | 16910 (36.33) | 4.33  |
| S_AXI_GP  | AXI Lite              | No    | 44  | 44      | 45   | 200229 (13.29) | 46075 (13.33) | 4.34  |
| S_AXI_HP  | AXI Lite              | No    | 36  | 36      | 37   | 160386 (16.59) | 36865 (16.66) | 4.35  |
| S_AXI_ACP | AXI Lite              | No    | 36  | 36      | 36   | 160389 (16.59) | 36864 (16.66) | 4.35  |
| S_AXI_GP  | AXI Full              | Yes   | 1   | 4       | 59   | 21962 (121.22) | 4868 (126.21) | 4.51  |
| S_AXI_HP  | AXI Full              | Yes   | 1   | 3       | 40   | 16669 (159.72) | 3675 (167.18) | 4.53  |
| S_AXI_ACP | AXI Full              | Yes   | 1   | 3       | 37   | 15506 (171.70) | 3409 (180.22) | 4.54  |
| M_AXI_GP  | AXI Full with PS DMA  | Yes   | 1   | 1       | 4    | 11425 (233.3)  | 1213 (506.51) | 9.41  |
| S_AXI_GP  | AXI Full with AXI DMA | Yes   | 1   | 1       | 571  | 7245 (367.48)  | 1654 (371.46) | 4.38  |
| S_AXI_HP  | AXI Full with AXI DMA | Yes   | 1   | 1       | 381  | 6048 (440.21)  | 1397 (439.79) | 4.32  |
| S_AXI_ACP | AXI Full with AXI DMA | Yes   | 1   | 1       | 422  | 6154 (432.62)  | 1418 (433.28) | 4.33  |

The ideal PS/PL relation is 650 MHz/150 MHz = 4.33





#### Custom AXI master vs AXI DMA







#### **Custom AXI Master improvment**

|           | Between Data          |     |     | Per Frame |                |               |       |
|-----------|-----------------------|-----|-----|-----------|----------------|---------------|-------|
| Interface | Variant               | min | typ | max       | PS (MB/s)      | PL (MB/s)     | PS/PL |
| S_AXI_GP  | AXI Lite              | 44  | 44  | 45        | 200229 (13.29) | 46075 (13.33) | 4.34  |
| S_AXI_HP  | AXI Lite              | 36  | 36  | 37        | 160386 (16.59) | 36865 (16.66) | 4.35  |
| S_AXI_ACP | AXI Lite              | 36  | 36  | 36        | 160389 (16.59) | 36864 (16.66) | 4.35  |
| S_AXI_GP  | AXI Full              | 1   | 4   | 59        | 21962 (121.22) | 4868 (126.21) | 4.51  |
| S_AXI_HP  | AXI Full              | 1   | 3   | 40        | 16669 (159.72) | 3675 (167.18) | 4.53  |
| S_AXI_ACP | AXI Full              | 1   | 3   | 37        | 15506 (171.70) | 3409 (180.22) | 4.54  |
| S_AXI_GP  | AXI Full with AXI DMA | 1   | 1   | 571       | 7245 (367.48)  | 1654 (371.46) | 4.38  |
| S_AXI_HP  | AXI Full with AXI DMA | 1   | 1   | 381       | 6048 (440.21)  | 1397 (439.79) | 4.32  |
| S_AXI_ACP | AXI Full with AXI DMA | 1   | 1   | 422       | 6154 (432.62)  | 1418 (433.28) | 4.33  |

 $\downarrow \downarrow \downarrow \downarrow \downarrow \downarrow \downarrow$ 

| Test Case |          | Be  | tween D | Data | Per Frame           |                    |       |  |
|-----------|----------|-----|---------|------|---------------------|--------------------|-------|--|
| Interface | Variant  | min | typ max |      | PS                  | PL                 | PS/PL |  |
| S_AXI_GP  | AXI Lite | 3   | 3       | 4    | 14382 (185.12 MB/s) | 3187 (192.78 MB/s) | 4.51  |  |
| S_AXI_HP  | AXI Lite | 3   | 3       | 3    | 13952 (190.82 MB/s) | 3072 (200. 0 MB/s) | 4.54  |  |
| S_AXI_ACP | AXI Lite | 3   | 5       | 8    | 26769 (99.45 MB/s)  | 5963 (103. 3 MB/s) | 4.48  |  |
| S_AXI_GP  | AXI Full | 1   | 1       | 5    | 6677 (398.74 MB/s)  | 1406 (436.98 MB/s) | 4.74  |  |
| S_AXI_HP  | AXI Full | 1   | 1       | 4    | 6456 (412.39 MB/s)  | 1342 (457.82 MB/s) | 4.81  |  |
| S_AXI_ACP | AXI Full | 1   | 1       | 5    | 6684 (398.32 MB/s)  | 1406 (436.98 MB/s) | 4.75  |  |





## Conclusions

- If burst transactions will not be used (neither DMA or cache) use AXI Lite interfaces (they are simpler and less PL resources are consumed).
- ► The AXI interfaces provided by the IP packager could/must be improved:
  - AXI Lite interfaces consume an extra cycle per operation.
  - AXI Full slave do not work with burst.
  - The address phase of AXI Full master can be changed to be at same time that TLAST (is what AXI DMA does).
  - The write response channel can be ignored to improve the data rate (is what AXI DMA does but IS NOT COMPLIANT WITH THE AMBA AXI SPEC).
- When 32-bit data is used in 64-bit interfaces, the burst transactions involves 64-bit transfer with one cycle between them.
- The PS DMA driver seems that could be improved to obtain very high data rates.
- The main disadvantage in GP interfaces is the 32-bit data width, due that slightly lower data rates are observed compared with HP/ACP.



in



#### INTI-CMNB-FPGA

#### Rodrigo A. Melo



Attribution-ShareAlike 4.0 International rmelo@inti.gob.ar http://creativecommons.org/licenses/ rodrigoalejandromelo

- @rodrigomelo9ok
- rodrigomelo9

#### Bruno Valinoti

#### valinoti@inti.gob.ar

in bruno-valinoti

# by-sa/4.0/

## Thanks!