Skip to content

dilithium: stream matrix A to cut ML-DSA-87 peak stack for embedded#90

Open
n13 wants to merge 3 commits into
masterfrom
stack-streaming-matrix
Open

dilithium: stream matrix A to cut ML-DSA-87 peak stack for embedded#90
n13 wants to merge 3 commits into
masterfrom
stack-streaming-matrix

Conversation

@n13

@n13 n13 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • ML-DSA-87 keygen/sign/verify currently materialize the full ExpandA matrix ([Polyvecl; K], ~56 KB) on the stack. On the Keystone3 / ForgeBox (Cortex-M, thumbv7em-none-eabihf) target this overflows the task stacks and freezes the device when deriving a Quantus address or signing.
  • Add matrix_pointwise_montgomery_streamed, which computes t = A * v while regenerating each A[i][j] from rho on the fly. Peak extra working set drops from ~56 KB to ~2 KB (two polynomials). Accumulation order is identical to the materialized path, so output is bit-for-bit unchanged.
  • SigningContext now stores the 32-byte public seed rho instead of the expanded matrix; A is streamed per rejection-sampling attempt (small recompute, large stack saving).

Measured impact (peak stack)

op before after
sign (thumbv7em frame) ~109 KB
verify ~50 KB
keypair ~39 KB
sign (host painted-stack probe) ~309 KB ~157 KB
keygen (host) ~173 KB ~69 KB
verify (host) ~188 KB ~84 KB

Pre-flight stack tooling (new)

  • dilithium/examples/stack_probe.rs — painted-stack high-water probe (psm dev-dependency) for host peak-stack numbers.
  • stack-check.sh — builds dilithium for thumbv7em-none-eabihf with -Z emit-stack-sizes and reports per-function frames via llvm-readobj --stack-sizes, with an optional KB budget check to catch regressions (e.g. accidentally re-materializing the matrix) before flashing.

Correctness

  • Verified byte-identical keygen/sign/verify against the NIST KAT (tests/ integration suite).

Test plan

  • cargo test (workspace, incl. KAT integration tests) passes
  • cargo run --release --example stack_probe -p qp-rusty-crystals-dilithium
  • ./stack-check.sh reports expected frames
  • On-device: select Quantus, enter PIN, derive address + sign a tx without freeze

Notes

HD derivation / address generation continue to use rusty-crystals unchanged. A deeper BRS22 buffer-reuse refactor (to fit small SRAM task stacks directly, without the dedicated PSRAM crypto task on the firmware side) is a possible follow-up.


Note

Medium Risk
Core cryptographic paths changed (matrix multiply implementation) but claim bit-identical results; signing does more recomputation per attempt. Medium risk due to embedded-critical stack fix and algorithm path swap, mitigated by KAT/tests and equivalence intent.

Overview
ML-DSA-87 keygen, sign, and verify no longer allocate the full ExpandA matrix (~56 KB on stack). A new matrix_pointwise_montgomery_streamed computes A·v by expanding each A[i][j] from rho on the fly (~2 KB scratch), with the same accumulation order as before so outputs stay bit-identical.

Signing keeps the 32-byte public seed rho in SigningContext instead of a materialized matrix; each rejection attempt re-streams A from rho.

Tooling: stack_probe (host painted-stack peak measurement via psm) and stack-check.sh (Cortex-M thumbv7em build with -Z emit-stack-sizes, optional KB budget) help catch stack regressions before flashing embedded targets.

Reviewed by Cursor Bugbot for commit 24259fa. Configure here.

ML-DSA-87 keygen/sign/verify materialized the full ExpandA matrix
([Polyvecl; K], ~56 KB) on the stack, which overflows the small task
stacks on the Keystone3/ForgeBox (Cortex-M) target and freezes the
device.

Add `matrix_pointwise_montgomery_streamed`, which computes t = A*v while
regenerating each A[i][j] from rho on the fly (peak extra working set is
two polynomials, ~2 KB). Accumulation order is identical to the
materialized path, so output is bit-for-bit unchanged (verified against
the NIST KAT).

- keypair/verify: stream A instead of matrix_expand + pointwise
- SigningContext: store the 32-byte public seed rho instead of the
  expanded matrix; A is streamed per rejection-sampling attempt

Tooling to catch stack regressions before flashing:
- examples/stack_probe.rs: painted-stack high-water probe (psm dev-dep)
- stack-check.sh: thumbv7em-none-eabihf build + llvm-readobj stack-sizes
  report with an optional budget check
Comment thread dilithium/src/polyvec.rs Dismissed
Comment thread dilithium/src/polyvec.rs Dismissed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants