

### <u>Mirko Mariotti</u> <sup>1,2</sup> Giulio Bianchini <sup>1</sup> Loriano Storchi <sup>3,2</sup> Daniele Spiga <sup>2</sup> Diego Ciangottini <sup>2</sup> Giuseppe Prudente <sup>2</sup>

<sup>1</sup>Dipartimento di Fisica e Geologia, Universitá degli Studi di Perugia

<sup>2</sup>INFN sezione di Perugia

<sup>3</sup>Dipartimento di Farmacia, Universitá degli Studi G. D'Annunzio

M.Mariotti, Workshop CCR 2025



Spoke 2 use cases - Ultra-fast algorithms running on FPGA Development of a Customizable Framework for Multi-FPGA Accelerator Generation via architectures

M.Mariotti, Workshop CCR 2025

We mainly focus on using FPGA as a hardware accelerator for scientific computing.

We are interested in both low-level programming and high-level synthesis.

We are also proposing a new architecture called BondMachine (BM) that is designed to be used as a hardware accelerator.



M.Mariotti, Workshop CCR 2025

FPGA R&D

We mainly focus on using FPGA as a hardware accelerator for scientific computing.

We are interested in both low-level programming and high-level synthesis.

We are also proposing a new architecture called BondMachine (BM) that is designed to be used as a hardware accelerator.



M.Mariotti, Workshop CCR 2025

FPGA R&D

We mainly focus on using FPGA as a hardware accelerator for scientific computing.

We are interested in both low-level programming and high-level synthesis.

We are also proposing a new architecture called BondMachine (BM) that is designed to be used as a hardware accelerator.



FPGA R&D

### The BondMachine Framework

The BondMachine is an open source software ecosystem for the dynamical generation of computer architectures that can be synthesized on FPGAs.

- High level programming language (Golang) for both the hardware and software
- Functional style programming
- Architecture generating compiler
- Computational graph and Machine Learning Models



The BondMachine, a moldable computer architecture - doi.org/10.1016/j.parco.2021.102873 - https://www.bondmachine.it



# Latency and throughput analysis

### occupancy analysis

- Energy efficiency analysis
  - Comparison with other architectures
- Numerical precision analysis
- Data type and/ or instruction set Analysis





- Latency and throughput analysis
- occupancy analysis
- Energy efficiency analysis
- Comparison with other architectures
- Numerical precision analysis
- Data type and/ or instruction set Analysis



- Latency and throughput analysis
- occupancy analysis
- Energy efficiency analysis
- Comparison with other architectures
- Numerical precision analysis
- Data type and/ or instruction set Analysis

| Solution        | LUTs   | LUTs REGs |        | Time / Inf (µs) |  |
|-----------------|--------|-----------|--------|-----------------|--|
| HLS4ML          | 10.31% | 6.89%     |        | ~ 0.71          |  |
| BondMachine     | 15.73% | 7.94%     |        | ~ 1.4           |  |
| CPU             | Time   | / Inf (s) |        | En. / Inf (J)   |  |
| ARM Cortex A9   | 108    | 10E-02    |        | 10E-06          |  |
| Intel i7-1260P  | 108    | E-06      | 10E-04 |                 |  |
| NVIDIA Tesla T4 | 108    | -04       |        | 10E-03          |  |
| ZedBoard BM     | 108    | 0E-06     |        | 10E-08          |  |
| L               |        |           |        |                 |  |

M.Mariotti, Workshop CCR 2025

Latency and throughput analysis

- occupancy analysis
- Energy efficiency analysis
- Comparison with other architectures
- Numerical precision analysis
- Data type and/ or instruction set Analysis



- Latency and throughput analysis
- occupancy analysis
- Energy efficiency analysis
- Comparison with other architectures
- Numerical precision analysis
- Data type and/ or instruction set Analysis

| Data Type   | LUTs   |       | REGs   |       | DSPs  |       | Latency (µs)     | Accuracy (%) |  |
|-------------|--------|-------|--------|-------|-------|-------|------------------|--------------|--|
|             | Count  | (%)   | Count  | (%)   | Count | (%)   | $12.29 \pm 0.15$ | 100          |  |
| float32     | 476416 | 36.54 | 456235 | 17.50 | 954   | 10.57 | $8.65 \pm 0.15$  | 99.17        |  |
| float16     | 288944 | 22.16 | 298191 | 11.44 | 479   | 5.31  | $6.23 \pm 0.18$  | 100.00       |  |
| flpe7f22    | 423915 | 32.52 | 352113 | 13.50 | 950   | 10.53 | $4.46 \pm 0.21$  | 100.00       |  |
| flpe5f11    | 393657 | 30.20 | 318821 | 12.23 | 477   | 5.29  | $4.49 \pm 0.18$  | 100.00       |  |
| flpe6f10    | 442809 | 33.97 | 334414 | 12.83 | 4     | 0.04  | $2.80 \pm 0.15$  | 97.78        |  |
| flpe4f9     | 347633 | 26.67 | 275653 | 10.57 | 4     | 0.04  |                  |              |  |
| flpe5f8     | 299033 | 22.94 | 261403 | 10.03 | 4     | 0.04  | $3.31 \pm 0.12$  | 99.74        |  |
| flpe6f4     | 274523 | 21.06 | 236429 | 9.07  | 4     | 0.04  | $2.72 \pm 0.23$  | 96.39        |  |
| fixed<16,8> | 205071 | 15.73 | 207670 | 7.94  | 477   | 5.29  | $1.39 \pm 0.06$  | 86.03        |  |

R&D: Problems

SoC, edge and low power computing

M.Mariotti, Workshop CCR 2025

Quantum Computing Simulation with FPGA



R&D: Problems

Machine Learning (inference)



M.Mariotti, Workshop CCR 2025

### Quantum Computing Simulation with FPGA

We started experimenting with quantum computing. Our main interested is using FPGA to simulate quantum computers.

The goal is to experiment with classical/quantum hybrid computing backed by the CPU/FPGA hardware.

The work plan goes on 4 main directions:

- Learning and experimenting with reference quantum tools and establishing a testing framework to validate and compare the results of different quantum simulators. Activity 1
- BondMachine based quantum simulator. Activity 2
- HLS based quantum simulator. Activity 3
- Symbolic Quantum Operator FPGA based simulator. Activity 4

## Validation

#### Activity 1

To test the correctness of the quantum simulator we are developing, we need to compare the results of the simulation with the results of a well-known quantum simulators.

We set up a validation framework in the bmqsimtests repository, at the url: https://github.com/BondMachineHQ/bmqsimtests

The repository is organized in two levels of directories. The first level is the quantum circuit to simulate, the second level is the specific simulatur to use. A Jupiter notebook is provided to run the tests and compare the results.

the readme.md file contains the instructions to run the tests and describe the two layer directory structure of the tests.

M.Mariotti, Workshop CCR 2025







With all the capabilities of the BondMachine in terms of parallelism and speed, of customizability of the instruction set and the numerical precision, it is a natural question to ask whether the BondMachine could be used to simulate quantum computers.



A quantum computer simulator called bmqsim has been developed and is available within the BondMachine project.

M.Mariotti, Workshop CCR 2025

## Quantum Circuit

Activity 2

The first ingredient for bmqsim is a quantum circuit. The quantum circuit is a sequence of quantum gates represented by a sequence of matrices. the "program" is a .bmq file that contains code similar to the Qasm code.



Independently of the backend, bmqsim translates the .bmq file into N matrices.

M.Mariotti, Workshop CCR 2025



A command line option allows to choose the backend to use. M.Mariotti, Workshop CCR 2025 Quantum Computing Simulation with FPGA

### Backend: Software Simulation

In here, the quantum gates are simulated by the CPU. This is the slowest backend, but it useful for circuit design, debugging and testing. An example:

| ‰bash<br>cat program<br>✓ 0.0s                          | ı.bmq                                       |  |  |  |
|---------------------------------------------------------|---------------------------------------------|--|--|--|
| %block code1 .<br>qbits<br>zero<br>x<br>cx<br>%endblock | sequential<br>q0,q1<br>q0,q1<br>q0<br>q0,q1 |  |  |  |
| %meta bmdef gl                                          | obal main:code1                             |  |  |  |

‱bash

bmqsim -software-simulation -software-simulation-input inputs.json -software-simulation-output outputs.json program.bmq

M.Mariotti, Workshop CCR 2025

Backend: Hardcoded matrices sequence

This backend creates a hardware that for each state of the quantum register, it applies the sequence of matrices.

For each matrix operation a dedicated processor is used. Within the processor, the matrix elements of all the gates are hardcoded.







The matrices elements of the gates are already inside each processor. There no movement of big matrices.

Fast

#### Cons:

- The circuit is fixed. to use a different circuit hardware has to be re-synthesized.
- Matrices are fully expanded. This may lead to a big hardware.
- Sparse matrices uses hardware anyway.

Backend: Loadable matrices sequence

Similar to the previous backend, but the matrices are loaded from the final application command line. This allows to change the matrices without recompiling the hardware.

To do so a small boot loader is needed on every processor. And a protocol to load the matrices elements from the final application.

### Pros:

- The matrices elements of the gates are already inside each processor. There no movement of big matrices.
- Fast
- The circuit is fixed, but a new circuit can be injected by the final application.

#### Cons:

- Matrices are fully expanded. This may lead to a big hardware.
  - Sparse matrices uses hardware anyway.

### Backend: Full hardware deploy

In this backend, the quantum circuit is synthesized in full hardware. Instead of having a state that is updated by each gate, only the relevant parts of the state are updated. Keeping track of the entalgment of the qubits and the sparce nature of the matrices.

### Fast

Less resources used with respect to the previous backends

#### Cons:

Pros

The circuit is fixed and cannot be changed.

Backend: HLS Hardcoded matrices sequence

#### Activity 3

This backend is similar to the BM hardcoded matrices sequence backend, but it uses the HLS toolchain to create the hardware instead of the BM toolchain.

The matrices are hardcoded in the HLS (C++) code. The HLS pragmas are used to create the hardware. The HLS code is then compiled with the Vitis HLS toolchain.

There is no processor abstraction, the hardware is lighter that the BM hardcoded matrices sequence backend.

#### Cons:

- The circuit is fixed. to use a different circuit hardware has to be re-synthesized.
- Without the processor abstraction, the hardware is less flexible. Classical/quantum hybrid computing is not possible.

Alongside the FPGA hardware, bmqsim can create the end application that can be used to simulate quantum circuits.

Three types of applications are available:

- Jupiter Notebook using the PYNQ framework
- Standalone C application using pynq-api
- C++/OpenCL application

The application are tailored to the specific board, circuit and backend used.

Applications



Example from qiskit-symb - https://github.com/SimoneGasperini/qiskit-symb.git

Flexpy (FPGA Logic from EXpressions)

We created a project called flexpy to convert symbolic mathematical expressions into FPGA logic.

The symbolic expressions are created using the sympy library. flexpy can parse the expression and create either BASM (the BondMachine assembly language) or C/C++ code with HLS pragmas.

The generated code can be used to create an hardware accelerator for the given expression using the BondMachine or HLS toolchain.

The approach is much more general, it can be used not only for quantum computing, but also for any other application that can be expressed as a symbolic expression.

flexpy repository: https://github.com/BondMachineHQ/flexpy.git

M.Mariotti, Workshop CCR 2025

Activity 4



Activity 4

Some mathematical operations (for example the cosine) are flequently used in quantum computing.

These operations can either be implemented in hardware (HDL) or in software (in assembly in the BM case). Whatever the case, implementing these operations is a complex task.

flexpy can be used to generate test units to check the correctness of the implementation of these operations.

Moreover, the test units can be used to check also the errors introduced by using reduced precision data types and operations.

 $flexpytester\ repository:\ https://github.com/BondMachineHQ/flexpytester.git Symbolic tests repository:\ https://github.com/BondMachineHQ/bmsymtests.git$ 

M.Mariotti, Workshop CCR 2025























Conclusions, Ongoing Work and Future Work We enabled the simulation of quantum circuits using FPGA.

Ongoing work:

The inclusion of a parametric quantum circuit in the bmqsim framework.

The development of all the instructions needed to simulate quantum circuits in the flexpy framework.

The development of a HLS backend for the flexpy framework.

### Goals:

Operate useful quantum circuits on FPGA and apply all the techniques we developed to test and optimize the computation in terms of latency, throughput, power consumption and numerical precision.

Use the HPC bubbles with FPGA accelerators to run (not only) quantum circuits on multi-FPGA systems.