L1 FIFO Communication#20
Draft
diaconuccalin wants to merge 31 commits into
Draft
Conversation
…erted. Cleaned-up gitignore
…the desired dimensions for the test). Removed the "timesteps" mechanism for data tiling. Added CMake pipeline for running the attention test. Added CI test pipeline for attention test
…e and applied formatting to modified files. Added DMA documentation to idma.h for improved VS Code hinting
…t, instead of _Float16
- make build now writes targets/*/include/addr_map/tile_config.h (gitignored) instead of patching tile_addr_map.h in-place; tile_addr_map.h now #includes it - Added fp16_to_f64() bit-manipulation helper to attention_utils.h for toolchain-agnostic fp16 printing (no soft-float helpers required) - New test_gemm: 4-GEMM chain with task-level parallelism across tiles (Phase 1: tile 0/1 run GEMM1/GEMM2 in parallel; Phase 2: tile 2 runs GEMM3; Phase 3: tile 3 runs GEMM4); includes gen_golden.py and make gemm-* targets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move GEMM test files under gemm/via_l2/ to prepare for multiple communication variants (L2, L1, etc.), and update references in Makefile and CMakeLists.txt accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Create gemm_utils.h with fp16_to_millis and fp16_to_f64 - GEMM test includes gemm_utils.h directly - Comment out flatatt/flatatt_no_data_tiling in CMakeLists.txt - Remove all compiler-conditional sed toggling from Makefile build target - Comment out attention_utils.h include in tile.h Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restructure the flatatt test to use an embedded test header instead of Python-generated golden data. Remove flatatt_no_data_tiling variant and the flatatt CI workflow. Update attention_utils.h (v1/v2) and eu_isa_utils.h with revised utility functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch GCC_MULTILIB toolchain to riscv64-unknown-elf from PATH instead of hardcoded $HOME/riscv/bin paths. Use GCC_PULP compiler in GEMM CI. Add configurable seed parameter to Makefile and simplify .gitignore patterns for generated test inputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… READMEs - Rename tests/magia/mesh/gemm/ → gemm_comm/, via_l2/ → via_l2_naive/ - Move gen_golden.py to gemm_comm root; update Makefile and usage strings - Move gemm_utils.h from targets/magia_v2/include/utils/ to gemm_comm/; update include path in CMakeLists and test.c - Add README.md to gemm_comm/ (with test diagram) and via_l2_naive/ (phase breakdown) - Update .gitignore pattern and mesh CMakeLists.txt accordingly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces a new test variant where GEMM4 tiles prefetch M5 into L1 during Phase 1 (overlapping with GEMM1/GEMM2 computation), eliminating the M5 DMA in Phase 3. gen_golden.py now writes test.h to both the naive and interlaced include directories. Adds gemm-interlaced-test and gemm-interlaced-ci Makefile targets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Moves tests into a shared via_L2/ subdirectory, updates CMakeLists include paths (../ → ../../ to reach gemm_comm/), gen_golden.py output paths, .gitignore glob, and adds the interlaced job to CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same 4-GEMM chain as via_L2/naive, but intermediate results (R1, R2, R3)
are DMA'd directly between producing and consuming tiles' L1 memories
instead of staging through shared L2 buffers. Only M1-M5 inputs come from
L2; the final output O is written back to L2 for validation.
Scatter logic: each producing tile iterates over potential consuming tiles,
computes the row overlap, and issues idma_memcpy_1d(dir=1) targeting the
remote tile's L1 address derived from get_l1_base(hartid).
Also updates gen_golden.py to write test.h to via_L1/naive/include/,
adds gemm-l1-naive-{run,test,ci} Makefile targets, and adds the
gemm-l1-naive CI job to gemm-ci.yml.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Combines direct L1-to-L1 intermediate result scatter (from via_L1/naive) with interlaced M5 prefetch scheduling (from via_L2/interlaced): GEMM4 tiles prefetch M5 during Phase 1 behind a local barrier, so Phase 3 needs no L2 loads at all. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces l1_fifo.h — a per-tile FIFO mailbox in L1 memory, lock-protected via amoswap.w, supporting cross-tile message passing without touching L2. Includes it in tile.h as a standard utility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace spin-lock linked-list FIFO with a lock-free slot-based design: each (matrix_id, row_index) pair has exactly one producer writing to a pre-assigned slot, so no lock is needed. A fence w,w before setting the valid flag ensures the consumer sees all payload bytes before the slot becomes visible. Add gemm_comm/fifo test variant and wire it into the CMakeLists and gen_golden.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace volatile word-copy fifo_push() calls with fifo_push_dma(), which transfers the payload via iDMA and then publishes metadata with a fence. Add fifo_slot_publish() to l1_fifo.h to support this split-phase pattern. Switch from fixed FIFO_BATCH_ROWS to FIFO_BATCH_FRAC (fraction of the tile's owned rows per push, default 20%) so batch sizes scale with tile count. Remove debug printf statements throughout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace wait-all-R2-then-compute approach with data-driven partial accumulation. R3 contributions are now computed as soon as any (R1, R2) piece pair is available, using K-dimension decomposition: R3 += R1[:, k1:k2] × R2[k1:k2, :] Key changes: - Add push_r3_to_gemm4() helper to eliminate duplicated R3→GEMM4 logic - Add gemm3_partial_accum() helper for one partial product computation - Replace single r2_rows_received counter with per-row r2_received[] tracking - Track K-progress per R1 batch with r3_k_done[], avoiding double-counting - Zero R3 once upfront; RedMule does Y = X*W + Y (accumulate in-place) - R1 handler: scan for contiguous R2 groups and accumulate each - R2 handler: immediately accumulate against all present R1 batches This reduces latency by overlapping computation with communication. Correctness verified: each (R1, R2) pair processed exactly once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch all CI/Makefile GEMM targets to use bash scripts directly with TEST_NAME env var, add gemm-fifo job to GitHub Actions, and update CMakeLists.txt to build all five gemm_comm/ variants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The gemm_comm tests require gen_golden.py-generated test.h files that are not present in CI. This flag lets CI opt out without removing the subdirectory entries from CMakeLists.txt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
13a8ca1 to
dc3da0a
Compare
Instruments barrier waits, input/output DMA, RedMulE compute, FIFO push, and consumer spin-wait with PERF_DELTA accumulators; prints a summary line for one representative tile per GEMM group. Also adds the PERF_DELTA macro to performance_utils.h, guards memset against loop-distribution miscompilation, and fixes the gen_golden.py path in via_L2/naive/CMakeLists.txt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces cyc_preamble to measure GEMM1 setup overhead (tile-group detection, row-range calculation) separately from compute/DMA phases. Adds test_with_prints.c as an excluded-from-build debug copy; the CMakeLists filter keeps the normal build clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.