Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AXI returns unexpected 0s #1576

Open
zzy666666zzy opened this issue Jun 19, 2023 · 4 comments
Open

AXI returns unexpected 0s #1576

zzy666666zzy opened this issue Jun 19, 2023 · 4 comments
Labels
C: FPGA Changes for the FPGA backend Type: Bug Bug in the implementation

Comments

@zzy666666zzy
Copy link

Hi,

Following the previous issue #1470, we have managed to run a 4x4 GEMM on a Standalone (bared-metal) platform.

When we tried bigger matrix sizes (8x8, 16x16), I printed the output matrix from the ARM sides, the IP core only gave us 16 correct elements (2 rows of correct elements for 8x8, 1 row of correct elements for 16x16). The rest elements are all zeros:
1687208165571
The correct output should be
1687208241004

When I delve into the generated Toplevel AXI verilog code, I found some clues.

In our case, 'm1' outputs the resultant matrix. We go to the module Memory_controller_axi_1, and look at this snippet

reg [6:0] send_addr_offset;
...
assign WDATA = {{15{32'b0}}, bram_read_data} << send_addr_offset * 32;

bram_read_data is the result from the core GEMM logic (ext_mem0__write_data) and always produces correct results.

BUT, when we try bigger matrix sizes and send_addr_offset equals is bigger than 16, WDATA is always zero because it left shifts (>16)x32 bit which is over 512-b. Previously, we could get the correct result from 4x4 because we luckily saturated the 512-b with 16 resultant elements.

I am sure the core logic without the AXI wrapper can output correct results, we have tested 4x4 to 64x64. We have changed discover-external:default=xxx according to our matrix sizes. From the generated AXI verilog code we can tell that discover-external:default=xxx affects send_addr_offset and thereby affects the number of shifts. I attached the MLIR and the AXI code here for convenient reference.
gemm8x8.mlir.txt
gemm8x8_wrapper.sv.txt

Also the doubled under-score issue in main kernel_inst still exists, signal inconsistency leads to synthesis failure.

Appreciate it if there is any solution for returning 0s and the doubled underscore issue.

Many thanks.

@zzy666666zzy
Copy link
Author

This issue can be addressed by adding the following Verilog to the destination master port in AXI top-level file

reg [3:0] shift_var;
always @(posedge ACLK) begin
if(memory_mode_state == 3) begin
	if(BVALID & BREADY) shift_var <= shift_var + 'd1;
	//else if (BVALID & BREADY & shift_var=='d15) shift_var<='d0;
	else shift_var <= shift_var;
end 
else shift_var <= 0;
end

And replace send_addr_offset with shift_var in

    assign WDATA = {{15{32'b0}}, bram_read_data} << send_addr_offset* 32;
    assign WSTRB = {{15{4'h0}}, 4'hF} << send_addr_offset* 4;

Might this bit modification can be realized in Calyx source code.

@rachitnigam
Copy link
Contributor

Reopening this! @zzy666666zzy we were all traveling last week so can look into this a bit more now that we’re back.

@rachitnigam rachitnigam added C: FPGA Changes for the FPGA backend Type: Bug Bug in the implementation labels Jun 27, 2023
@zzy666666zzy
Copy link
Author

Hi @rachitnigam
circt_calyx4x4.sh.txt
gemm4x4.mlir.txt
gemm8x8.mlir.txt

Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached), arg0 is the output argument (output ext_mem0), arg1 and arg2 are input matrix (ext_mem1_* and 'ext_mem2'_* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size.
For example: inside Control_axi module, of 4x4 GEMM, the order of the three ports is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end

But for 16x16 the order is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end

The sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through s_axi_control.

Is there any approach to avoid the shuffled sequence of addr_ext_mem*_?

Thanks a lot.

@zzy666666zzy
Copy link
Author

Hi @rachitnigam circt_calyx4x4.sh.txt gemm4x4.mlir.txt gemm8x8.mlir.txt

Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached), arg0 is the output argument (output ext_mem0), arg1 and arg2 are input matrix (ext_mem1_* and 'ext_mem2'_* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size. For example: inside Control_axi module, of 4x4 GEMM, the order of the three ports is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end

But for 16x16 the order is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end

The sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through s_axi_control.

Is there any approach to avoid the shuffled sequence of addr_ext_mem*_?

Thanks a lot.

I updated the calyx repo, the shuffled sequence of ext_mem* seems being solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: FPGA Changes for the FPGA backend Type: Bug Bug in the implementation
Projects
None yet
Development

No branches or pull requests

2 participants