dilithium: stream matrix A to cut ML-DSA-87 peak stack for embedded#90
Open
n13 wants to merge 3 commits into
Open
dilithium: stream matrix A to cut ML-DSA-87 peak stack for embedded#90n13 wants to merge 3 commits into
n13 wants to merge 3 commits into
Conversation
ML-DSA-87 keygen/sign/verify materialized the full ExpandA matrix ([Polyvecl; K], ~56 KB) on the stack, which overflows the small task stacks on the Keystone3/ForgeBox (Cortex-M) target and freezes the device. Add `matrix_pointwise_montgomery_streamed`, which computes t = A*v while regenerating each A[i][j] from rho on the fly (peak extra working set is two polynomials, ~2 KB). Accumulation order is identical to the materialized path, so output is bit-for-bit unchanged (verified against the NIST KAT). - keypair/verify: stream A instead of matrix_expand + pointwise - SigningContext: store the 32-byte public seed rho instead of the expanded matrix; A is streamed per rejection-sampling attempt Tooling to catch stack regressions before flashing: - examples/stack_probe.rs: painted-stack high-water probe (psm dev-dep) - stack-check.sh: thumbv7em-none-eabihf build + llvm-readobj stack-sizes report with an optional budget check
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
[Polyvecl; K], ~56 KB) on the stack. On the Keystone3 / ForgeBox (Cortex-M,thumbv7em-none-eabihf) target this overflows the task stacks and freezes the device when deriving a Quantus address or signing.matrix_pointwise_montgomery_streamed, which computest = A * vwhile regenerating eachA[i][j]fromrhoon the fly. Peak extra working set drops from ~56 KB to ~2 KB (two polynomials). Accumulation order is identical to the materialized path, so output is bit-for-bit unchanged.SigningContextnow stores the 32-byte public seedrhoinstead of the expanded matrix;Ais streamed per rejection-sampling attempt (small recompute, large stack saving).Measured impact (peak stack)
sign(thumbv7em frame)verifykeypairsign(host painted-stack probe)keygen(host)verify(host)Pre-flight stack tooling (new)
dilithium/examples/stack_probe.rs— painted-stack high-water probe (psmdev-dependency) for host peak-stack numbers.stack-check.sh— builds dilithium forthumbv7em-none-eabihfwith-Z emit-stack-sizesand reports per-function frames viallvm-readobj --stack-sizes, with an optional KB budget check to catch regressions (e.g. accidentally re-materializing the matrix) before flashing.Correctness
tests/integration suite).Test plan
cargo test(workspace, incl. KAT integration tests) passescargo run --release --example stack_probe -p qp-rusty-crystals-dilithium./stack-check.shreports expected framesNotes
HD derivation / address generation continue to use rusty-crystals unchanged. A deeper BRS22 buffer-reuse refactor (to fit small SRAM task stacks directly, without the dedicated PSRAM crypto task on the firmware side) is a possible follow-up.
Note
Medium Risk
Core cryptographic paths changed (matrix multiply implementation) but claim bit-identical results; signing does more recomputation per attempt. Medium risk due to embedded-critical stack fix and algorithm path swap, mitigated by KAT/tests and equivalence intent.
Overview
ML-DSA-87 keygen, sign, and verify no longer allocate the full ExpandA matrix (~56 KB on stack). A new
matrix_pointwise_montgomery_streamedcomputesA·vby expanding eachA[i][j]fromrhoon the fly (~2 KB scratch), with the same accumulation order as before so outputs stay bit-identical.Signing keeps the 32-byte public seed
rhoinSigningContextinstead of a materialized matrix; each rejection attempt re-streamsAfromrho.Tooling:
stack_probe(host painted-stack peak measurement viapsm) andstack-check.sh(Cortex-Mthumbv7embuild with-Z emit-stack-sizes, optional KB budget) help catch stack regressions before flashing embedded targets.Reviewed by Cursor Bugbot for commit 24259fa. Configure here.