

#### <sup>1</sup>Dipartimento di Fisica e Geologia, Universitá degli Studi di Perugia

<sup>2</sup>INFN sezione di Perugia

<sup>3</sup>Dipartimento di Farmacia, Universitá degli Studi G. D'Annunzio

Workshop CCR 2022 - Paestum



- 2 An accelerated system from ground up Hardware Software
- 3 Tests and Benchmarks Tests Benchmark
- 4 Conclusions and Future directions



An accelerated system from ground up Hardware Software

#### 3 Tests and Benchmarks Tests Benchmark

#### 4 Conclusions and Future directions

Workshop CCR 2022 - Paestum











From the hardware point of view: Accelerators

Workshop CCR 2022 - Paestum

#### Accelerators

Hardware device or software program designed to improve the performance of certain workload.

Graphics Processing Unit (GPU)

High Data Throughput Massive Parallel Computing



Application-Specific Integrated Circuit (ASIC)

Highly specialized for a task

Energy efficient (due to specialization)



Workshop CCR 2022 - Paestum

A field programmable gate array (FPGA) is an integrated circuit whose logic is re-programmable.

- Parallel computing Highly specialized
- Energy efficient





- Array of programmable logic blocks
  - Logic blocks configurable to perform complex functions
- The configuration is specified with the hardware description language

# Firmware generation

Many projects have the goal of abstracting the firmware generation process.



Workshop CCR 2022 - Paestum

The BondMachine is a software ecosystem for the dynamical generation of computer architectures that can be synthesized of FPGA and



Workshop CCR 2022 - Paestum

The BondMachine is a software ecosystem for the dynamical generation of computer architectures that can be synthesized of FPGA and

used as standalone devices,



Workshop CCR 2022 - Paestum

The BondMachine is a software ecosystem for the dynamical generation of computer architectures that can be synthesized of FPGA and

used as standalone devices,

as clustered devices,



Workshop CCR 2022 - Paestum

The BondMachine is a software ecosystem for the dynamical generation of computer architectures that can be synthesized of FPGA and

used as standalone devices,

as clustered devices,

and, as in this talk, as firmware for computing accelerators.

Workshop CCR 2022 - Paestum



The BondMachine, a mouldable computer architecture Mirko Mariotti<sup>1</sup> Daniel Magalotti<sup>2</sup> Introduction e Booldfachine (BM) is a new computer architecture where many Connecting Processors (DP) with different instruction for Architecture(BA) are connected with th other and share resources to form an **Independence** encemble perfectly (filed to a specific computational problem. These come are implemented with th successful to be an initial as patche and as prove as patches, and the capacity of solving problem role main the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of solving problem role in the transmission of the capacity of the capac In a presentations of reconfigurable hardware. Mereover the "replater machine" abstruction has been kept in order to use many well known tools an income sampling from languages to complete. througest straping from tanguages to completes. Is architecture can be used as general purpose computer architecture or as high specialized device perfectly suited to specific problems and flexible enough to i his antihitecture can be used as general purpose computer antinecture or as man specialized dents, purpose series and in scenarios like internet of Things (IoTCL). Cyber Physical System (CPSIC) and High Performance Computing (IOPC) The BoodMachine architecture The BordMachine architecture canalists of interconnections among Connecting Processors and Shared Objects (SOI) build to implement a dedicated tasks. The value features of this kind of architecture are the possibility to configure the number of protestar cares and their types. the number of inputs and outputs. The interconnection between processors, the number and the type of SDs used by each processor Connection Processor Shared Object the CP is the computing core of the BondMachine. Several CPs can be configured in addrawy connection topologies within the BandMachine. They be shared among CPs. Shared Objects increase the processing campility and the functionality of within the BandMachine. They the BM improving the hig can have different regulars ramber, birstion set, in communication between woldness with respect to the synchronization and reverses statistics of the BoodPlackies address/or R. Locales of Inc. Inputs and these is interconnected learners the input/subject suppress of the processor. Marriel injects, such as new Channel and Barrier, see controlled arising the processor. Software Tools Hardware implementation conviction of personal state The RTL code automatically generated by the builders is synthesized for the RTGA ACTR33ST (Dilloc ACTR5 evaluation carel to measure the performances of the architecture: logic resources, power consumption using a set of reference tools too - lookd a specify and/lecture as handlen of the task, - modify the created enthiestone, - simulate the behavior and to theck the functionality with the aim to generate the hegister 'tankie' Loved (PL), Lode for PPDA device. Processor Redder sales in Classifics assessing and discountly and Markins Ballder connects (The and 10) breather in-The FPEA can contain up to 256 CPs with a clock theya of 208 MHz and a power consumption of 6.53 VE tech-memories consider the fit toos one in presents the CR eccentric code The performance of the producture have been compare with the Ge ones. A benchmark has been used -measured the time per operation needed to the production to complete the turk. Case study arabit increases; the time per operation is candiant for the PPGA due to to industrial according on to its of the modulute bars Evolutionary BondMachine ionly the needed opendes are producted, diff 541 for it and the assembly rode to run on it Conclusion BondMashine is a new kind of computing device made poolsie is practice only by the extension of new to-programmable handpare to

entment of Physics and Geolog

Workshop CCR 2022 - Paestum

CCR 2015 First ideas, 2016 Poster, 2017 Talk InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA



Workshop CCR 2022 - Paestum

CCR 2015 First ideas, 2016 Poster, 2017 Talk InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA Invited lectures at: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019



CCR 2015 First ideas, 2016 Poster, 2017 Talk InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA

Invited lectures at: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019

Invited lectures at: "NiPS Summer School 2019 – Architectures and Algorithms for Energy-Efficient IoT and HPC Applications"



- CCR 2015 First ideas, 2016 Poster, 2017 Talk InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA
- Invited lectures at: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019
- Invited lectures at: "NiPS Summer School 2019 – Architectures and Algorithms for Energy-Efficient IoT and HPC Applications"
- Golab 2018 talk and ISGC 2019 PoS



- CCR 2015 First ideas, 2016 Poster, 2017 Talk InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA
- Invited lectures at: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019
- Invited lectures at: "NiPS Summer School 2019
- Architectures and Algorithms for Energy-Efficient IoT and HPC Applications"
- Golab 2018 talk and ISGC 2019 PoS
- Article published on Parallel Computing, Elsevier 2022



- CCR 2015 First ideas, 2016 Poster, 2017 Talk InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA
- Invited lectures at: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019
- Invited lectures at: "NiPS Summer School 2019
- Architectures and Algorithms for Energy-Efficient IoT and HPC Applications"
- Golab 2018 talk and ISGC 2019 PoS
- Article published on Parallel Computing, Elsevier 2022
- PON PHD program

# An accelerated system from ground up



An accelerated system from ground up Hardware Software

3 Tests and Benchmarks Tests Benchmark

#### 4 Conclusions and Future directions

Workshop CCR 2022 - Paestum













## Interconnection firmware



PS (arm)

FPGA

Custom







The Advanced eXtensible Interface Protocol

AXI is a communication bus protocol defined by ARM as part of the Advanced Microcontroller Bus Architecture (AMBA) standard. There are 3 types of AXI Interfaces:

AXI Full: for high-performance memory-mapped requirements. AXI Lite: for low-throughput memory-mapped communication.

AXI Stream: for high-speed streaming data.

| = trable interrupt Support | + -<br>b interfaces<br>$\oplus$ [300_Act] | Name<br>Interface Type<br>Interface Wode<br>Clata Width (RRS)<br>Merrory Size (Dyte<br>Namber of Registe |  | 0<br>v<br>v<br>v<br>v<br>0<br>[4.312] |
|----------------------------|-------------------------------------------|----------------------------------------------------------------------------------------------------------|--|---------------------------------------|
|----------------------------|-------------------------------------------|----------------------------------------------------------------------------------------------------------|--|---------------------------------------|



Workshop CCR 2022 - Paestum



#### Linux

Now that we have a custom accelerated hardware, we need a Linux distro to run on it.

#### **Common Features**

Complete system build from source Allow choice of kernel and bootloader Support for modifying packages with patches or custom configuration files Can build cross-toolchains for development Convenient support for read-only root filesystems Support offline builds The build configuration files integrate well with SCM tools

#### Yocto

Convenient sharing of build configuration among similar projects (meta-layers) Larger community (Linux Foundation project) Can build a toolchain that runs on the target A package management system

#### Buildroot

Simple Makefile approach, easier to understand how the build system works Reduced resource requirements on the build machine Very easy to customize the final root filesystem (overlays)

Credits: https://jumpnowtek.com/linux/Choosing-an-embedded-linux-build-system.html

Workshop CCR 2022 - Paestum





#### kernel module

The accelerator endpoints are exposed via AXI memory-mapped as memory location of the arm processor running Linux.

To properly use the accelerator from user space, the kernel has to handle the accelerator endpoints and make them available to user space.

We developed a kernel module for our accelerators. It manages 3 data flows:





Workshop CCR 2022 - Paestum

## Kernel from an to user space: char device

The communication are through the standard read and write system call on a kernel generated char device

A language has been implemented for the desired operations





AXI guarantees consistency and transfer to the firmware input ports. Moreover the data flow from kernel cannot saturate the PL part.

Workshop CCR 2022 - Paestum

Firmware development for hybrid processors

PS (arm)

App

Linux based OS

Firmware to kernel: IRQ

Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.

The firmware collect all the changes to send and fill in a list using a dedicated AXI register



Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.

The firmware collect all the changes to send and fill in a list using a dedicated AXI register

Stop accepting new changes from the IP



Firmware to kernel: IRQ

Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.

The firmware collect all the changes to send and fill in a list using a dedicated AXI register

Stop accepting new changes from the IP Send an interrupt request to the kernel



Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.



Firmware development for hybrid processors

FPGA

Custom HW design

Interconnec

Wires

Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.



Firmware development for hybrid processors

FPGA

Custom HW design

Interconnec

Wires C

Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.



Firmware development for hybrid processors

FPGA

Custom HW design

Interconnec

Wires C









#### 3 Tests and Benchmarks Tests Benchmark

#### 4 Conclusions and Future directions

Workshop CCR 2022 - Paestum



Check of the correctness of the accelerator results

Benchmark of the execution

Workshop CCR 2022 - Paestum













Correctness and module debug

To verify the correct computation of the accelerator:

a tool to monitor the AXI memory

|     | /monitor -g @ | x43c00000 -n | 8 |              |              |              |
|-----|---------------|--------------|---|--------------|--------------|--------------|
| i0: |               | (0x43c00003) |   | (0x43c00002) | (0x43c00001) | (0x43c00000) |
| i1: |               | (0x43c00007) |   | (0x43c00006) | (0x43c00005) | (0x43c00004) |
| 12: |               | (0x43c0000b) |   | (0x43c0000a) | (0x43c00009) | (0x43c00008) |
| (3: |               | (0x43c0000f) |   | (0x43c0000e) | (0x43c0000d) | (0x43c0000c) |
| 14: |               | (0x43c00013) |   | (0x43c00012) | (0x43c00011) | (0x43c00010) |
| 15: |               | (0x43c00017) |   | (0x43c00016) | (0x43c00015) | (0x43c00014) |
| i6: |               | (0x43c0001b) |   | (0x43c0001a) | (0x43c00019) | (0x43c00018) |
| 17: |               | (0x43c0001f) |   | (0x43c0001e) | (0x43c0001d) | (0x43c0001c) |
| PS2 | PL: 0000008   | (0x43c00023) |   | (0x43c00022) | (0x43c00021) | (0x43c00020) |
| STA |               | (0x43c00027) |   | (0x43c00026) | (0x43c00025) | (0x43c00024) |
| 00: |               | (0x43c0002b) |   | (0x43c0002a) | (0x43c00029) | (0x43c00028) |
| 01: |               | (0x43c0002f) |   | (0x43c0002e) | (0x43c0002d) | (0x43c0002c) |
| 02: |               | (0x43c00033) |   | (0x43c00032) | (0x43c00031) | (0x43c00030) |
| 03: |               | (0x43c00037) |   | (0x43c00036) | (0x43c00035) | (0x43c00034) |
| 04: |               | (0x43c0003b) |   | (0x43c0003a) | (0x43c00039) | (0x43c00038) |
| 05: |               | (0x43c0003f) |   | (0x43c0003e) | (0x43c0003d) | (0x43c0003c) |
| 06: |               | (0x43c00043) |   | (0x43c00042) | (0x43c00041) | (0x43c00040) |
| 07: |               | (0x43c00047) |   | (0x43c00046) | (0x43c00045) | (0x43c00044) |
| ber |               | (0x43c0004b) |   | (0x43c0004a) | (0x43c00049) | (0x43c00048) |
| PL2 | PS: 0000008   | (0x43c0004f) |   | (0x43c0004e) | (0x43c0004d) | (0x43c0004c) |
| CHA | NGE: 0000008  | (0x43c00053) |   | (0x43c00052) | (0x43c00051) | (0x43c00050) |
|     |               |              |   |              |              |              |

Correctness and module debug

To verify the correct computation of the accelerator:

a tool to monitor the AXI memory

write directly to AXI memory mapped input addresses (through devmem)

|          | <br><43c00000 −n |                              |               |              |
|----------|------------------|------------------------------|---------------|--------------|
| i0:      | (0x43c00003)     | (0-42-00001)                 | (8:(12:00001) | (0.12-00000) |
| i1:      |                  | (0x43c00002)<br>(0x43c00006) |               |              |
| 12:      | (0x43c0000b)     |                              |               |              |
| 13:      |                  | (0x43c0000e)                 |               |              |
| 14:      | (0x43c00013)     |                              |               |              |
| 15:      | (0x43c00017)     |                              | (0x43c00015)  |              |
| 16:      | (0x43c0001b)     |                              |               |              |
| 17:      | (0x43c0001f)     |                              | (0x43c0001d)  |              |
| PS2PL:   | (0x43c00023)     |                              |               |              |
| STATES:  | (0x43c00027)     |                              | (0x43c00025)  |              |
| 08:      | (0x43c0002b)     | (0x43c6662a)                 | (0x43c00029)  | (0x43c00028) |
| 01:      | (0x43c0002f)     |                              | (0x43c0002d)  |              |
| 02:      | (@x43c00033)     |                              | (0x43c00031)  |              |
| 03:      | (0x43c00037)     | (0x43c00036)                 | (0x43c00035)  | (0x43c00034) |
| 04:      | (0x43c0003b)     | (0x43c0003a)                 | (0x43c00039)  | (0x43c00038) |
| 05:      | (@x43c0003f)     | (0x43c0003e)                 | (0x43c0003d)  | (0x43c0003c) |
| 06:      | (0x43c00043)     | (0x43c00042)                 | (0x43c00041)  | (0x43c00040) |
| 07:      | (0x43c00047)     | (0x43c00046)                 | (0x43c00045)  | (0x43c00044) |
| bench:   | (0x43c0004b)     | (0x43c0004a)                 | (0x43c00049)  | (0x43c00048) |
| PL2PS:   | (0x43c0004f)     | (0x43c0004e)                 | (0x43c0004d)  | (0x43c0004c) |
| CHANGE : | (0x43c00053)     | (0x43c00052)                 | (0x43c00051)  | (0x43c00050) |

#### devmem @x43c00000 b 1

Correctness and module debug

To verify the correct computation of the accelerator:

a tool to monitor the AXI memory

write directly to AXI memory mapped input addresses (through devmem)

check the AXI memory mapped output addresses

| 1 |          | <br>         |              | _ |              |              |
|---|----------|--------------|--------------|---|--------------|--------------|
|   |          |              |              |   |              |              |
|   |          | (0x43c00003) |              |   |              |              |
|   |          | (0x43c00007) | (0x43c00006) |   | (0x43c00005) | (0x43c00004) |
|   |          | (0x43c0000b) | (0x43c0000a) |   | (0x43c00009) | (0x43c00008) |
|   |          | (0x43c0000f) | (0x43c0000e) |   | (0x43c0000d) | (0x43c0000c) |
|   |          | (0x43c00013) | (0x43c00012) |   | (0x43c00011) | (0x43c00010) |
|   |          | (0x43c00017) | (0x43c00016) |   | (0x43c00015) | (0x43c00014) |
|   |          | (0x43c0001b) | (0x43c0001a) |   | (0x43c00019) | (0x43c00018) |
|   |          | (0x43c0001f) | (0x43c0001e) |   | (0x43c000ld) | (0x43c0001c) |
|   | PS2PL:   | (0x43c00023) | (0x43c00022) |   | (0x43c00021) | (0x43c00020) |
|   | STATES:  | (0x43c00027) | (0x43c00026) |   | (0x43c00025) | (0x43c00024) |
|   |          | (0x43c0002b) | (0x43c0002a) |   | (0x43c00029) | (0x43c00028) |
|   |          | (0x43c0002f) | (0x43c0002e) |   | (0x43c0002d) | (0x43c0002c) |
|   |          | (0x43c00033) | (0x43c00032) |   | (0x43c00031) | (0x43c00030) |
|   |          | (0x43c00037) | (0x43c00036) |   | (0x43c00035) | (0x43c00034) |
|   |          | (0x43c0003b) | (0x43c0003a) |   | (0x43c00039) | (0x43c00038) |
|   |          | (0x43c0003f) | (0x43c0003e) |   | (0x43c0003d) | (0x43c0003c) |
|   | 06:      | (0x43c00043) | (0x43c00042) |   | (0x43c00041) | (0x43c00040) |
|   |          | (0x43c00047) | (0x43c00046) |   | (0x43c00045) | (0x43c00044) |
|   | bench:   | (0x43c0004b) | (0x43c0004a) |   | (0x43c00049) | (0x43c00048) |
|   | PL2PS:   | (0x43c0004f) | (0x43c0004e) |   | (0x43c0004d) | (0x43c0004c) |
|   | CHANGE : | (0x43c00053) |              |   |              |              |
|   |          |              |              |   |              |              |

#### devmem 0x43c00000 b 1

## An example of error

| # ./mor  | nitor -g ( | 0x43c00000 -n | 13 |              |         |                      |          |              |
|----------|------------|---------------|----|--------------|---------|----------------------|----------|--------------|
|          |            | (0x43c00003)  |    | (0x43c00002) |         | (0x43c00001)         |          | (0x43c00000) |
|          |            | (0x43c00007)  |    | (0x43c00006) |         | (0x43c00005)         |          | (0x43c00004) |
|          |            | (0x43c0000b)  |    | (0x43c0000a) |         | (0x43c00009)         |          | (0x43c00008) |
|          |            | (0x43c0000f)  |    | (0x43c0000e) |         | (0x43c0000d)         |          | (0x43c0000c) |
|          |            | (0x43c00013)  |    | (0x43c00012) |         | (0x43c00011)         |          | (0x43c00010) |
|          |            | (0x43c00017)  |    | (0x43c00016) |         | (0x43c00015)         |          | (0x43c00014) |
|          |            | (0x43c0001b)  |    | (0x43c0001a) |         | (0x43c00019)         |          | (0x43c00018) |
|          |            | (0x43c0001f)  |    | (0x43c0001e) |         | (0x43c0001d)         |          | (0x43c0001c) |
|          |            | (0x43c00023)  |    | (0x43c00022) |         | (0x43c00021)         |          | (0x43c00020) |
|          |            | (0x43c00027)  |    | (0x43c00026) |         | (0x43c00025)         |          | (0x43c00024) |
| i10:     |            | (0x43c0002b)  |    | (0x43c0002a) |         | (0x43c00029)         |          | (0x43c00028) |
|          |            | (0x43c0002f)  |    | (0x43c0002e) |         | (0x43c0002d)         |          | (0x43c0002c) |
|          |            | (0x43c00033)  |    | (0x43c00032) |         | (0x43c00031)         |          | (0x43c00030) |
| PS2PL:   |            | (0x43c00037)  |    | (0x43c00036) |         | (0x43c00035)         |          | (0x43c00034) |
| STATES:  |            | (0x43c0003b)  |    | (0x43c0003a) |         | (0x43c00039)         |          | (0x43c00038) |
| 00:      |            | (0x43c0003f)  |    | (0x43c0003e) |         | (0x43c0003d)         |          | (0x43c0003c) |
|          |            | (0x43c00043)  |    | (0x43c00042) |         | (0x43c00041)         |          | (0x43c00040) |
|          |            | (0x43c00047)  |    | (0x43c00046) |         | (0x43c00045)         |          | (0x43c00044) |
|          |            | (0x43c0004b)  |    | (0x43c0004a) |         | (0x43c000 <u>49)</u> |          |              |
| o4:      |            | (0x43c0004f)  |    | (0x43c0004e) |         | (0x43c000ld)         |          |              |
| o5:      |            | (0x43c00053)  |    | (0x43c00052) |         | (0x43c00051)         |          |              |
| 06:      | 000000000  | (0x43c00057)  |    | (0x43c00056) | 0000000 | (0x43c0005)          | 00000010 | (0:43c00054) |
|          |            | (0x43c0005b)  |    | (0x43c0005a) |         | (0x43c00059)         |          | (0x43c00058) |
| o8:      |            | (0x43c0005f)  |    | (0x43c0005e) |         | (0x43c0005d)         |          | (0x43c0005c) |
| o9:      |            | (0x43c00063)  |    | (0x43c00062) |         | (0x43c00061)         |          | (0x43c00060) |
| o10:     |            | (0x43c00067)  |    | (0x43c00066) | 0000000 | (0x43c00065)         |          | (0x43c00064) |
|          |            |               |    | (0x43c0006a) | 0000000 | (0x43c00069)         |          | (0x43c00068) |
|          |            | (0x43c0006f)  |    | (0x43c0006e) |         | (0x43c0006d)         |          | (0x43c0006c) |
|          |            | (0x43c00073)  |    |              |         | (0x43c00071)         |          | (0x43c00070) |
| PL2P5:   |            | (0x43c00077)  |    |              |         | (0x43c00075)         |          |              |
| CHANGE : |            | (0x43c0007b)  |    | (0x43c0007a) |         | (0x43c00079)         |          | (0x43c00078) |
|          |            |               |    |              |         |                      |          |              |

## An example of error

| i0:      |           | 0x43c00000 -n<br>(0x43c00003) |          | (0x43c00002) | (0x43c00001)         |          | (0x43c00000) |
|----------|-----------|-------------------------------|----------|--------------|----------------------|----------|--------------|
| i1:      |           |                               |          |              | (0x43c00005)         |          |              |
| i2:      |           |                               |          |              | (0x43c00009)         |          |              |
| i3:      |           |                               |          |              | (0x43c0000d)         |          |              |
| i4:      |           |                               |          |              | (0x43c00011)         |          |              |
|          |           |                               |          |              | (0x43c00015)         |          |              |
| i6:      |           |                               |          |              | (0x43c00019)         |          |              |
| i7:      |           |                               |          |              | (0x43c0001d)         |          |              |
| i8:      |           | (0x43c00023)                  |          |              |                      |          |              |
| i9:      |           | (0x43c00027)                  |          | (0x43c00026) | (0x43c00025)         |          | (0x43c00024) |
| i10:     |           | (0x43c0002b)                  |          | (0x43c0002a) | (0x43c00029)         |          | (0x43c00028) |
|          |           | (0x43c0002f)                  |          | (0x43c0002e) | (0x43c0002d)         |          | (0x43c0002c) |
|          |           | (0x43c00033)                  |          | (0x43c00032) | (0x43c00031)         |          | (0x43c00030) |
| PS2PL:   |           | (0x43c00037)                  |          | (0x43c00036) | (0x43c00035)         |          | (0x43c00034) |
| STATES:  |           | (0x43c0003b)                  |          | (0x43c0003a) | (0x43c00039)         |          | (0x43c00038) |
| 00:      |           | (0x43c0003f)                  |          | (0x43c0003e) | (0x43c0003d)         |          | (0x43c0003c) |
| o1:      |           | (0x43c00043)                  |          | (0x43c00042) | (0x43c00041)         |          | (0x43c00040) |
| o2:      |           | (0x43c00047)                  |          | (0x43c00046) | (0x43c00045)         |          | (0x43c00044) |
| o3:      |           | (0x43c0004b)                  |          | (0x43c0004a) | (0x43c000 <u>49)</u> | 00000100 | (0x43c00048) |
| o4:      |           | (0x43c0004f)                  |          | (0x43c0004e) | (0x43c000ld)         | 00000001 | (0:43c0004c) |
| o5:      |           |                               |          |              | (0x43c00051)         |          |              |
| 06:      |           |                               |          |              | (0x43c0005)          |          |              |
| o7:      |           | (0x43c0005b)                  |          | (0x43c0005a) | (0x43c00059)         | 00000010 | (0x43c00058) |
| 08:      |           | (0x43c0005f)                  |          | (0x43c0005e) | (0x43c0005d)         |          | (0x43c0005c) |
| o9:      |           | (0x43c00063)                  |          | (0x43c00062) | (0x43c00061)         |          | (0x43c00060) |
| o10:     |           | (0x43c00067)                  |          |              | (0x43c00065)         |          |              |
| o11:     |           | (0x43c0006b)                  |          |              | (0x43c00069)         |          |              |
| o12:     |           | (0x43c0006f)                  |          |              | (0x43c0006d)         |          |              |
|          |           |                               |          |              | (0x43c00071)         |          |              |
| PL2PS:   |           |                               |          |              | (0x43c00075)         |          |              |
| CHANGE : | 000000000 | (0x43c0007b)                  | 00000111 | (0x43c0007a) | (0x43c00079)         |          | (0x43c00078) |





The FPGA benchmarks do not include the PS part overhead (the comparisons are not really fair)

# Benchmark: the CPU (Golang)

| /                                                                                                                                                                                                  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>start := time.Now[]</pre>                                                                                                                                                                     |
| <pre>for k := 0; intE4(k) &lt; iter; k++ {     for t := 0; t &lt; n; t++ {         output(i) = wintE(0)         }</pre>                                                                            |
| for i := 0; i <= n; i += (<br>for j := 0; j <= n; j += (<br>)<br>)<br>)<br>)                                                                                                                       |
| <pre>return float32(time.5ince(start).Htcroseconds()) / float32(time) imm main() {     for (1=2; 1 &lt;= 32; 1++ {         fut.Println((, *,*,*, matrixtest(), 180098800))         }     } }</pre> |

Time measures: built-in golang facilities

- Energy measures: perf
- Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz

Go 1.18.2

| 2   | 0.00543209  | 259280   | 3.858015-00  |
|-----|-------------|----------|--------------|
| 3   | 0.01831868  | 454200   | 2.205#TE-06  |
| 4   | 0.02399964  | 722280   | L304002-06   |
| 5   | 0.00632906  | 1870400  | 9.34235-07   |
|     | 0.00570083  | 1471400  | 6.796214-41  |
| 7   | 0.07163811  | 1835800  | 5.365828-41  |
|     | 0.09997730  | 2737800  | 0.05364E-87  |
| 1   | 0.122397912 | 3429200  | 2.818136-47  |
| 30  | 0.16490378  | 4465500  | 2.239396-01  |
| 11  | 0.00173032  | 5530300  | L80822E-87   |
| 32  | 0.34205632  | 6643300  | L505216-87   |
| 33  | 0.3390.6412 | 7762800  | 1.398338-47  |
| 34  | 0.35400825  | 8954800  | L.13582E-07  |
| 15  | 0.3061176   | 18633508 | 9.40434E-00  |
| 25  | 0.44800504  | 11832200 | 8.455318-00  |
| 37  | 0.5084054   | 13004308 | 7.35542-08   |
| 35  | 0.5063083   | 15124500 | 6.52550-08   |
| 22  | 0.03375605  | 17024430 | 5.306326-00  |
| 20  | 0.708354    | 18718300 | 5.0728-08    |
| 21. | 0.3553206   | 22133800 | 4.517908-00  |
| 22  | 0.0030085   | 22525300 | 4.250706-00  |
| 23  | 0.07467220  | 27348930 | 3.414.714-01 |
| 24  | 1.3031791   | 28358308 | 3.429958-05  |





Workshop CCR 2022 - Paestum







Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Firmware development for hybrid processors

Workshop CCR 2022 - Paestum

Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



## Benchmark core clock cycles distributions



Workshop CCR 2022 - Paestum

## FPGA benchmark summary

| mary |  |  |
|------|--|--|
|      |  |  |
|      |  |  |

|   | N  | single op time (us) | Register LUTs | Slice LUTs | Power | single op energy (pJ) | CPs |
|---|----|---------------------|---------------|------------|-------|-----------------------|-----|
|   | 2  | 0.1044              | 947           | 875        | 0.005 | 522                   | 6   |
| 2 | 4  | 0.1587              | 1457          | 1813       | 0.015 | 2380.5                | 20  |
|   | 8  | 0.2819              | 3131          | 4897       | 0.049 | 13813.1               | 72  |
|   | 13 | 0.4456              | 6422          | 12819      | 0.138 | 61492.8               | 182 |
|   | 16 | 0.5234              | 7950          | 15979      | 0.160 | 83744                 | 272 |
|   | 24 | 0.7432              | 10974         | 22669      | 0.199 | 147896.8              | 600 |

## Benchmark core



Workshop CCR 2022 - Paestum

## Comparisons: Performace



Workshop CCR 2022 - Paestum

## Comparisons: Energy



Workshop CCR 2022 - Paestum



Tests Benchmark

#### 4 Conclusions and Future directions

Workshop CCR 2022 - Paestum

The creation of a firmware from ground up is not a mere exercise. It gives perspective on how heterogeneous system really works and what really is an FPGA accelerator

Even if the methodology and the tools were specifically created for the BondMachine project, the are sufficiently general to be appliable to other FPGA accelerators as well

FPGA is a groundbreaking technology but require a change of perspective in how we develop software

Conclusions



For the project:

- First DAQ use case
- Complete the inclusion of Intel and Lattice FPGAs and try a more performant Zynq based board
- Accelerator in a cloud workflow



website: http://bondmachine.fisica.unipg.it code: https://github.com/BondMachineHQ parallel computing paper: link contact email: mirko.mariotti@unipg.it