Skip to content

L1 FIFO Communication#20

Draft
diaconuccalin wants to merge 31 commits into
mainfrom
cd/L1_FIFO_communication
Draft

L1 FIFO Communication#20
diaconuccalin wants to merge 31 commits into
mainfrom
cd/L1_FIFO_communication

Conversation

@diaconuccalin

Copy link
Copy Markdown
Contributor

No description provided.

@diaconuccalin diaconuccalin self-assigned this Apr 29, 2026
@diaconuccalin diaconuccalin added the enhancement New feature or request label Apr 29, 2026
diaconuccalin and others added 28 commits May 6, 2026 13:57
…the desired dimensions for the test). Removed the "timesteps" mechanism for data tiling. Added CMake pipeline for running the attention test. Added CI test pipeline for attention test
…e and applied formatting to modified files. Added DMA documentation to idma.h for improved VS Code hinting
- make build now writes targets/*/include/addr_map/tile_config.h (gitignored)
  instead of patching tile_addr_map.h in-place; tile_addr_map.h now #includes it
- Added fp16_to_f64() bit-manipulation helper to attention_utils.h for
  toolchain-agnostic fp16 printing (no soft-float helpers required)
- New test_gemm: 4-GEMM chain with task-level parallelism across tiles
  (Phase 1: tile 0/1 run GEMM1/GEMM2 in parallel; Phase 2: tile 2 runs GEMM3;
  Phase 3: tile 3 runs GEMM4); includes gen_golden.py and make gemm-* targets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move GEMM test files under gemm/via_l2/ to prepare for multiple
communication variants (L2, L1, etc.), and update references in
Makefile and CMakeLists.txt accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Create gemm_utils.h with fp16_to_millis and fp16_to_f64
- GEMM test includes gemm_utils.h directly
- Comment out flatatt/flatatt_no_data_tiling in CMakeLists.txt
- Remove all compiler-conditional sed toggling from Makefile build target
- Comment out attention_utils.h include in tile.h

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restructure the flatatt test to use an embedded test header instead of
Python-generated golden data. Remove flatatt_no_data_tiling variant and
the flatatt CI workflow. Update attention_utils.h (v1/v2) and
eu_isa_utils.h with revised utility functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch GCC_MULTILIB toolchain to riscv64-unknown-elf from PATH instead
of hardcoded $HOME/riscv/bin paths. Use GCC_PULP compiler in GEMM CI.
Add configurable seed parameter to Makefile and simplify .gitignore
patterns for generated test inputs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… READMEs

- Rename tests/magia/mesh/gemm/ → gemm_comm/, via_l2/ → via_l2_naive/
- Move gen_golden.py to gemm_comm root; update Makefile and usage strings
- Move gemm_utils.h from targets/magia_v2/include/utils/ to gemm_comm/; update include path in CMakeLists and test.c
- Add README.md to gemm_comm/ (with test diagram) and via_l2_naive/ (phase breakdown)
- Update .gitignore pattern and mesh CMakeLists.txt accordingly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces a new test variant where GEMM4 tiles prefetch M5 into L1
during Phase 1 (overlapping with GEMM1/GEMM2 computation), eliminating
the M5 DMA in Phase 3. gen_golden.py now writes test.h to both the
naive and interlaced include directories. Adds gemm-interlaced-test
and gemm-interlaced-ci Makefile targets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Moves tests into a shared via_L2/ subdirectory, updates CMakeLists
include paths (../ → ../../ to reach gemm_comm/), gen_golden.py
output paths, .gitignore glob, and adds the interlaced job to CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same 4-GEMM chain as via_L2/naive, but intermediate results (R1, R2, R3)
are DMA'd directly between producing and consuming tiles' L1 memories
instead of staging through shared L2 buffers. Only M1-M5 inputs come from
L2; the final output O is written back to L2 for validation.

Scatter logic: each producing tile iterates over potential consuming tiles,
computes the row overlap, and issues idma_memcpy_1d(dir=1) targeting the
remote tile's L1 address derived from get_l1_base(hartid).

Also updates gen_golden.py to write test.h to via_L1/naive/include/,
adds gemm-l1-naive-{run,test,ci} Makefile targets, and adds the
gemm-l1-naive CI job to gemm-ci.yml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Combines direct L1-to-L1 intermediate result scatter (from via_L1/naive)
with interlaced M5 prefetch scheduling (from via_L2/interlaced): GEMM4
tiles prefetch M5 during Phase 1 behind a local barrier, so Phase 3
needs no L2 loads at all.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces l1_fifo.h — a per-tile FIFO mailbox in L1 memory, lock-protected
via amoswap.w, supporting cross-tile message passing without touching L2.
Includes it in tile.h as a standard utility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace spin-lock linked-list FIFO with a lock-free slot-based design:
each (matrix_id, row_index) pair has exactly one producer writing to a
pre-assigned slot, so no lock is needed. A fence w,w before setting the
valid flag ensures the consumer sees all payload bytes before the slot
becomes visible. Add gemm_comm/fifo test variant and wire it into the
CMakeLists and gen_golden.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace volatile word-copy fifo_push() calls with fifo_push_dma(), which
transfers the payload via iDMA and then publishes metadata with a fence.
Add fifo_slot_publish() to l1_fifo.h to support this split-phase pattern.

Switch from fixed FIFO_BATCH_ROWS to FIFO_BATCH_FRAC (fraction of the
tile's owned rows per push, default 20%) so batch sizes scale with tile
count. Remove debug printf statements throughout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace wait-all-R2-then-compute approach with data-driven partial
accumulation. R3 contributions are now computed as soon as any (R1, R2)
piece pair is available, using K-dimension decomposition:

  R3 += R1[:, k1:k2] × R2[k1:k2, :]

Key changes:
- Add push_r3_to_gemm4() helper to eliminate duplicated R3→GEMM4 logic
- Add gemm3_partial_accum() helper for one partial product computation
- Replace single r2_rows_received counter with per-row r2_received[] tracking
- Track K-progress per R1 batch with r3_k_done[], avoiding double-counting
- Zero R3 once upfront; RedMule does Y = X*W + Y (accumulate in-place)
- R1 handler: scan for contiguous R2 groups and accumulate each
- R2 handler: immediately accumulate against all present R1 batches

This reduces latency by overlapping computation with communication.
Correctness verified: each (R1, R2) pair processed exactly once.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch all CI/Makefile GEMM targets to use bash scripts directly with
TEST_NAME env var, add gemm-fifo job to GitHub Actions, and update
CMakeLists.txt to build all five gemm_comm/ variants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The gemm_comm tests require gen_golden.py-generated test.h files that
are not present in CI. This flag lets CI opt out without removing the
subdirectory entries from CMakeLists.txt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@diaconuccalin diaconuccalin force-pushed the cd/L1_FIFO_communication branch from 13a8ca1 to dc3da0a Compare May 6, 2026 12:01
diaconuccalin and others added 2 commits May 7, 2026 14:49
Instruments barrier waits, input/output DMA, RedMulE compute, FIFO push,
and consumer spin-wait with PERF_DELTA accumulators; prints a summary line
for one representative tile per GEMM group. Also adds the PERF_DELTA macro
to performance_utils.h, guards memset against loop-distribution miscompilation,
and fixes the gen_golden.py path in via_L2/naive/CMakeLists.txt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces cyc_preamble to measure GEMM1 setup overhead (tile-group
detection, row-range calculation) separately from compute/DMA phases.
Adds test_with_prints.c as an excluded-from-build debug copy; the
CMakeLists filter keeps the normal build clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant