Help understanding how to deploy and validate the model and how to follow one of the paper's implementations! #1445

norahatem · 2026-02-25T18:29:52Z

norahatem
Feb 25, 2026

Hello everyone,
First of all, thank you so much to everyone who has contributed to this amazing project, I have learnt a lot from it and it is super amazing project. Thanks for making it available and providing docs and tutorials!

I have some questions/confusions and I was hoping someone could help point me in the right direction. I will split them into two parts:

I need help understanding the different available paths going forward after building the project and exporting the IP.
I need some help understanding how the model implemented in the paper: "Fast convolutional neural networks on FPGAs with hls4ml" was actually synthesised into a pynq-z2 FPGA with the mentioned resource utilisation.

Please feel free to just reply to a tiny bit of this if you know the answer/reason and can help me understand!

Some context:
First, I have been able to produce an IP for the model I have. I will attach the model plot below. I am using the Nexys A7 board, part: xc7a100t-csg324-1

I trained a classification model using qkeras, setting the kernel widths to 6 bits and used no bias. Activations are also limited to 6 bits (relu). I don't use batchnorm layers in the quantised model nor do I use a softmax output layer.

In the configurations, I set the max_precision to 'fixed<16,5,TRN, WRAP,0>', default reuse factor to 50 and strategy to resource. I edit the reuse factors for the conv and dense layers to different values since I get warnings about invalid reuse factors for these layers. I do (78,48,48,40,48,32,32) consecutively. The choice for the reuse factor is really arbitrary but the network fits!

I use the hls_model.build(export=True,synth=True,vsynth=True,cosim=True) to build and export the IP.

The first bit
One thing that I am not entirely sure about is the difference between the resource utilisation estimate from the C synthesis and the Vivado synthesis report. I understand that the actual resources used may be less than the estimates, but the actual DSPs used are 0, and in the resource utilisation estimate, 170 units are estimated. I am slightly confused and I was wondering if this is actually normal behaviour.

Additionally, I have realised that the design implemented (using the build command) is using AXI communication. The samples are also not fed into the model one sample at a time; it takes a number of clock cycles to complete feeding in one sample (input shape is 64x13 and it takes at least 64 cycles to feed one sample into the network). Is that something normal in the implementation of the models, or does it have something to do with the shape of my inputs?
Here is how the project's module header:

module myproject (
        input_1_TDATA,
        layer25_out_TDATA,
        ap_clk,
        ap_rst_n,
        input_1_TVALID,
        input_1_TREADY,
        ap_start,
        layer25_out_TVALID,
        layer25_out_TREADY,
        ap_done,
        ap_ready,
        ap_idle
);


input  [207:0] input_1_TDATA;
output  [159:0] layer25_out_TDATA;
input   ap_clk;
input   ap_rst_n;
input   input_1_TVALID;
output   input_1_TREADY;
input   ap_start;
output   layer25_out_TVALID;
input   layer25_out_TREADY;
output   ap_done;
output   ap_ready;
output   ap_idle;

I also want to test the RTL model in simulation and deploy it onto the hardware (the nexys a7 board I have!). I do not wish to integrate the model within a bigger design. I am really lost on what my next step should be. I did some research, and one solution I found is using the microblaze as a softcore and using it as the controller for the AXI to handle the data transfer. I am currently considering this as a potential solution, but I was wondering if there are any other routes that would allow me to communicate with an external PC and handle the data transfer from there? I would really appreciate it if someone could help me understand the different routes that could be taken from this point onwards. This is going to be super helpful in helping me make a decision.
-Apologies if I am not able to make this question clear, but as I said, I am really lost on what the next step should be, hence why it may translate into a vague question :) -

The second bit
In the following paper: "Fast convolutional neural networks on FPGAs with hls4ml", doi: 10.1088/2632-2153/ac0ea1 The authors mention the following:

Although particle physics experiments mostly use large FPGAs, the hls4ml library can be readily used for smaller FPGAs, like those found on system-on-chip (SoC) or internet-of-things (IoT) devices, through increasing the reuse factor. To demonstrate this, we synthesize and deploy the smallest model that retains the original model accuracy, QP 7-bit, onto a low-cost TUL PYNQ-Z2 development board, equipped with a Xilinx Zynq XC7Z020 SoC (FPGA part number xc7z020clg400-1). This FPGA is signiﬁcantly smaller than the Xilinx Virtex UltraScale+ VU9P, and consists of 13,300 logic slices, each with four 6-input LUTs and 8 FFs, 630 kB of BRAM, and 220 DSP slices. As expected, a large reuse factor is needed in order to ﬁt the QP 7-bit model onto the Zynq XC7Z020. For a clock frequency of 100MHz, the resulting inference latency is 171 µs and up to 2,831 image classiﬁcations per second. This implementation uses a total of 91% of the LUTs, 97% of the DSPs, 33% of the FFs, and 44% of the BRAM. A summary is provided in Table 4. This demonstrates the ﬂexibility of hls4ml to accommodate SoC/IoT use cases, which can demand smaller FPGAs and tolerate millisecond latencies.

Is this perhaps utilising the Resource strategy or the Latency strategy? I tried experimenting with different reuse factors and different strategies, but I am not sure how the paper's implementation got around to using 97% of the DSPs? Does this perhaps include setting the bit widths to larger widths? Since it was mentioned in the paper that below a specific bit width, the HLS tool performs multiplications using LUTs, so how did they get pretty much all the DSPs working?

Is there a code repo that includes the code used to report results in any of the papers that are affiliated with hls4ml? That would be helpful in understanding the papers' implementations more!

Thank you so much in advance for any help you could provide. I tried to provide as much useful context as possible, but it is possible I may have missed to include important bits of information that could help people help me. If that is the case, kindly let me know.

Nora

model plot:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help understanding how to deploy and validate the model and how to follow one of the paper's implementations! #1445

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Help understanding how to deploy and validate the model and how to follow one of the paper's implementations! #1445

Uh oh!

norahatem Feb 25, 2026

Replies: 0 comments

norahatem
Feb 25, 2026