L1 FIFO Communication by diaconuccalin · Pull Request #20 · pulp-platform/magia-sdk

diaconuccalin · 2026-04-29T14:00:24Z

No description provided.

…erted. Cleaned-up gitignore

…the desired dimensions for the test). Removed the "timesteps" mechanism for data tiling. Added CMake pipeline for running the attention test. Added CI test pipeline for attention test

…e and applied formatting to modified files. Added DMA documentation to idma.h for improved VS Code hinting

…t, instead of _Float16

- make build now writes targets/*/include/addr_map/tile_config.h (gitignored) instead of patching tile_addr_map.h in-place; tile_addr_map.h now #includes it - Added fp16_to_f64() bit-manipulation helper to attention_utils.h for toolchain-agnostic fp16 printing (no soft-float helpers required) - New test_gemm: 4-GEMM chain with task-level parallelism across tiles (Phase 1: tile 0/1 run GEMM1/GEMM2 in parallel; Phase 2: tile 2 runs GEMM3; Phase 3: tile 3 runs GEMM4); includes gen_golden.py and make gemm-* targets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move GEMM test files under gemm/via_l2/ to prepare for multiple communication variants (L2, L1, etc.), and update references in Makefile and CMakeLists.txt accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Create gemm_utils.h with fp16_to_millis and fp16_to_f64 - GEMM test includes gemm_utils.h directly - Comment out flatatt/flatatt_no_data_tiling in CMakeLists.txt - Remove all compiler-conditional sed toggling from Makefile build target - Comment out attention_utils.h include in tile.h Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restructure the flatatt test to use an embedded test header instead of Python-generated golden data. Remove flatatt_no_data_tiling variant and the flatatt CI workflow. Update attention_utils.h (v1/v2) and eu_isa_utils.h with revised utility functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch GCC_MULTILIB toolchain to riscv64-unknown-elf from PATH instead of hardcoded $HOME/riscv/bin paths. Use GCC_PULP compiler in GEMM CI. Add configurable seed parameter to Makefile and simplify .gitignore patterns for generated test inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… READMEs - Rename tests/magia/mesh/gemm/ → gemm_comm/, via_l2/ → via_l2_naive/ - Move gen_golden.py to gemm_comm root; update Makefile and usage strings - Move gemm_utils.h from targets/magia_v2/include/utils/ to gemm_comm/; update include path in CMakeLists and test.c - Add README.md to gemm_comm/ (with test diagram) and via_l2_naive/ (phase breakdown) - Update .gitignore pattern and mesh CMakeLists.txt accordingly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduces a new test variant where GEMM4 tiles prefetch M5 into L1 during Phase 1 (overlapping with GEMM1/GEMM2 computation), eliminating the M5 DMA in Phase 3. gen_golden.py now writes test.h to both the naive and interlaced include directories. Adds gemm-interlaced-test and gemm-interlaced-ci Makefile targets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Moves tests into a shared via_L2/ subdirectory, updates CMakeLists include paths (../ → ../../ to reach gemm_comm/), gen_golden.py output paths, .gitignore glob, and adds the interlaced job to CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Same 4-GEMM chain as via_L2/naive, but intermediate results (R1, R2, R3) are DMA'd directly between producing and consuming tiles' L1 memories instead of staging through shared L2 buffers. Only M1-M5 inputs come from L2; the final output O is written back to L2 for validation. Scatter logic: each producing tile iterates over potential consuming tiles, computes the row overlap, and issues idma_memcpy_1d(dir=1) targeting the remote tile's L1 address derived from get_l1_base(hartid). Also updates gen_golden.py to write test.h to via_L1/naive/include/, adds gemm-l1-naive-{run,test,ci} Makefile targets, and adds the gemm-l1-naive CI job to gemm-ci.yml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Combines direct L1-to-L1 intermediate result scatter (from via_L1/naive) with interlaced M5 prefetch scheduling (from via_L2/interlaced): GEMM4 tiles prefetch M5 during Phase 1 behind a local barrier, so Phase 3 needs no L2 loads at all. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduces l1_fifo.h — a per-tile FIFO mailbox in L1 memory, lock-protected via amoswap.w, supporting cross-tile message passing without touching L2. Includes it in tile.h as a standard utility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace spin-lock linked-list FIFO with a lock-free slot-based design: each (matrix_id, row_index) pair has exactly one producer writing to a pre-assigned slot, so no lock is needed. A fence w,w before setting the valid flag ensures the consumer sees all payload bytes before the slot becomes visible. Add gemm_comm/fifo test variant and wire it into the CMakeLists and gen_golden.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace volatile word-copy fifo_push() calls with fifo_push_dma(), which transfers the payload via iDMA and then publishes metadata with a fence. Add fifo_slot_publish() to l1_fifo.h to support this split-phase pattern. Switch from fixed FIFO_BATCH_ROWS to FIFO_BATCH_FRAC (fraction of the tile's owned rows per push, default 20%) so batch sizes scale with tile count. Remove debug printf statements throughout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace wait-all-R2-then-compute approach with data-driven partial accumulation. R3 contributions are now computed as soon as any (R1, R2) piece pair is available, using K-dimension decomposition: R3 += R1[:, k1:k2] × R2[k1:k2, :] Key changes: - Add push_r3_to_gemm4() helper to eliminate duplicated R3→GEMM4 logic - Add gemm3_partial_accum() helper for one partial product computation - Replace single r2_rows_received counter with per-row r2_received[] tracking - Track K-progress per R1 batch with r3_k_done[], avoiding double-counting - Zero R3 once upfront; RedMule does Y = X*W + Y (accumulate in-place) - R1 handler: scan for contiguous R2 groups and accumulate each - R2 handler: immediately accumulate against all present R1 batches This reduces latency by overlapping computation with communication. Correctness verified: each (R1, R2) pair processed exactly once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch all CI/Makefile GEMM targets to use bash scripts directly with TEST_NAME env var, add gemm-fifo job to GitHub Actions, and update CMakeLists.txt to build all five gemm_comm/ variants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The gemm_comm tests require gen_golden.py-generated test.h files that are not present in CI. This flag lets CI opt out without removing the subdirectory entries from CMakeLists.txt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Instruments barrier waits, input/output DMA, RedMulE compute, FIFO push, and consumer spin-wait with PERF_DELTA accumulators; prints a summary line for one representative tile per GEMM group. Also adds the PERF_DELTA macro to performance_utils.h, guards memset against loop-distribution miscompilation, and fixes the gen_golden.py path in via_L2/naive/CMakeLists.txt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduces cyc_preamble to measure GEMM1 setup overhead (tile-group detection, row-range calculation) separately from compute/DMA phases. Adds test_with_prints.c as an excluded-from-build debug copy; the CMakeLists filter keeps the normal build clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

diaconuccalin self-assigned this Apr 29, 2026

diaconuccalin added the enhancement New feature or request label Apr 29, 2026

diaconuccalin and others added 28 commits May 6, 2026 13:57

FlatAttention test fixes. Some compiler fixes that may need to be rev…

75ccfee

…erted. Cleaned-up gitignore

Apply remaining fixes to flatatt

d0b5bac

Made the attention test generation dynamic (the user can now specify …

f36888d

…the desired dimensions for the test). Removed the "timesteps" mechanism for data tiling. Added CMake pipeline for running the attention test. Added CI test pipeline for attention test

Added flat attention test without data tiling. Added clang format fil…

05981cb

…e and applied formatting to modified files. Added DMA documentation to idma.h for improved VS Code hinting

Reverted Claude erronous toolchain adaptations. Switched to float16al…

48e94c5

…t, instead of _Float16

Updated gvsoc_init make flow. Updated testing pipeline

b363771

Reorganize GEMM test into via_l2 subdirectory

b7d4bd3

Move GEMM test files under gemm/via_l2/ to prepare for multiple communication variants (L2, L1, etc.), and update references in Makefile and CMakeLists.txt accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add interlaced GEMM test step to CI workflow

d1b8551

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Small fixes

f7c8ebd

Post-rebase cleanup

46ab793

CI pipeline fix

d64ead3

Added double buffering to FIFO test GEMM 1 and GEMM 2

9e81a78

Format fix

f028b48

Temporary quickfix for profiling. Test renaming

dc3da0a

diaconuccalin force-pushed the cd/L1_FIFO_communication branch from 13a8ca1 to dc3da0a Compare May 6, 2026 12:01

diaconuccalin and others added 2 commits May 7, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L1 FIFO Communication#20

L1 FIFO Communication#20
diaconuccalin wants to merge 31 commits into
mainfrom
cd/L1_FIFO_communication

diaconuccalin commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

diaconuccalin commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant