diff --git a/docs/Lab/workflow_facade.md b/docs/Lab/workflow_facade.md index 340d214d4..602890e44 100644 --- a/docs/Lab/workflow_facade.md +++ b/docs/Lab/workflow_facade.md @@ -68,7 +68,7 @@ traces out of IPC payloads. - [Analyzer API](analyzer_api.md) - the plugin-registry primitive used by per-crate plugin traits. -- [CLI reference](cli.md) - the binaries currently shipping; the old +- [CLI reference](cli.md) - the binaries currently available; the old `pccx_analyze` umbrella does not exist today. ## Cite This Page diff --git a/docs/evidence/no-unsupported-claims-policy.md b/docs/evidence/no-unsupported-claims-policy.md index a590c1fc1..b483257b4 100644 --- a/docs/evidence/no-unsupported-claims-policy.md +++ b/docs/evidence/no-unsupported-claims-policy.md @@ -15,9 +15,9 @@ PCCX™ public docs and PRs must not claim any of the following without measured, reproducible evidence (each phrase is the *exact* claim form that requires evidence): -- on-board KV260 inference operability +- board inference operability - end-to-end Gemma 3N E4B runtime on KV260 -- numeric tokens-per-second targets (e.g. 20 tok/s) +- numeric throughput claims or targets without measurement - timing-closure completion - bitstream-success outcomes - production-readiness diff --git a/docs/ip/trademark-filing-log.md b/docs/ip/trademark-filing-log.md index 95710df2c..c45530b62 100644 --- a/docs/ip/trademark-filing-log.md +++ b/docs/ip/trademark-filing-log.md @@ -60,7 +60,7 @@ trademark docket issue in `pccxai/pccx` for the working list. - Use `PCCX™` on first prominent mention in any new public-facing document. -- Do not use `PCCX®`, `registered trademark`, or `등록상표` until +- Use only the `™` form; do not use registered-mark symbols or wording until registration is granted in the relevant jurisdiction and this policy is explicitly updated. - Treat `PCCX Compatible` and `PCCX Certified` as future diff --git a/docs/ip/trademarks.md b/docs/ip/trademarks.md index 92194c4d4..108ab6fab 100644 --- a/docs/ip/trademarks.md +++ b/docs/ip/trademarks.md @@ -17,7 +17,7 @@ here are claims of use, not statements of registration. For the canonical entry point and the live filing docket, see [`TRADEMARKS.md`](../../TRADEMARKS.md) (root) and [`trademark-filing-log.md`](trademark-filing-log.md) (this directory). -Use `PCCX™` on first prominent mention; do not use `PCCX®` until +Use `PCCX™` on first prominent mention; do not use registered-mark symbols until registration is granted in the relevant jurisdiction. ## Claimed marks diff --git a/docs/legal/README.md b/docs/legal/README.md index 3925ada78..c9fa8c93a 100644 --- a/docs/legal/README.md +++ b/docs/legal/README.md @@ -16,7 +16,7 @@ decisions. ## Trademark - [`../../TRADEMARKS.md`](../../TRADEMARKS.md) — canonical trademark - policy. Use `PCCX™`. Do **not** use `PCCX®` until and unless + policy. Use `PCCX™`; avoid registered-mark symbols until and unless registration is granted. - [`../ip/trademark-filing-log.md`](../ip/trademark-filing-log.md) — public-safe filing docket (KR Class 09 / 42). diff --git a/docs/quickstart.md b/docs/quickstart.md index db25f25e0..cf8d1a998 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -103,9 +103,9 @@ No. v002.1 is the planned sparsity and speculative-decoding ramp on the same KV260 RTL line. The baseline v002.0 integration and evidence gates remain visible dependencies. -### Does the 20 tok/s figure mean measured throughput? +### Does the v002.1 throughput target mean measured throughput? -No. It is a v002.1 target. The docs may discuss it as a target, but it +No. It is a v002.1 release-line target. The docs may discuss it as a target, but it must not be phrased as achieved throughput until KV260 evidence lands in {doc}`Evidence/index`. diff --git a/docs/roadmap.md b/docs/roadmap.md index 5f5dc782c..9cd5bf97c 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -43,7 +43,7 @@ Tracking issue: [pccxai/pccx#28 — v0.2.0 umbrella][v020]. - same RTL repository (`pccx-FPGA-NPU-LLM-kv260`), continued from v002.0 - G sparsity / H–H+ EAGLE-3 / I SSD / J Tree / K benchmark phases -- 20 tok/s target lives on this release line +- v002.1 throughput target lives on this release line - compute budget for EAGLE head training: $70–100 ($40 if a TRC TPU grant lands) diff --git a/docs/v002/Models/gemma3n_attention_rope.rst b/docs/v002/Models/gemma3n_attention_rope.rst index 7c2e10729..d1bcb7c3d 100644 --- a/docs/v002/Models/gemma3n_attention_rope.rst +++ b/docs/v002/Models/gemma3n_attention_rope.rst @@ -129,7 +129,7 @@ cycle. The two simplifications together remove **one CVO_SCALE** and **one CVO_TANH** per attention block per layer. Over the 35 layers of Gemma 3N E4B, that is 70 CVO invocations saved per decode step. Against -the v002.1 throughput target (~20 tok/s; see :doc:`../../roadmap`), the +the v002.1 throughput target (see :doc:`../../roadmap`), the SFU budget saves roughly 2–3 % wall-clock time. .. seealso:: diff --git a/docs/v002/Models/gemma3n_execution.rst b/docs/v002/Models/gemma3n_execution.rst index 876545d04..3ba525d49 100644 --- a/docs/v002/Models/gemma3n_execution.rst +++ b/docs/v002/Models/gemma3n_execution.rst @@ -2,7 +2,7 @@ Gemma 3N E4B on pccx v002 — Execution and Scheduling ================================================================= -This page explains *how* a single decode token of Gemma 3N E4B runs +This page explains the single-token Gemma 3N E4B decode path end-to-end on pccx v002 — which tensor lives where, which instruction fires which core, and how the scheduler keeps all three compute engines busy. @@ -214,7 +214,7 @@ Under the baseline configuration (W4A8 compute path, INT4 KV cache, .. note:: - The 20 tok/s figure is the **v002.1** release-line target (sparsity + The v002.1 throughput target is a release-line target (sparsity + speculative decoding on top of the v002.0 baseline RTL). The v002.0 release line is measured-only — no throughput figure is asserted until KV260 evidence is reported. See :doc:`../../roadmap`. @@ -227,7 +227,7 @@ Under the baseline configuration (W4A8 compute path, INT4 KV cache, - Target - Source of bottleneck * - Decode throughput - - **20 tok/s** — v002.1 target + - v002.1 throughput target - GEMV bandwidth at 400 MHz × 4 lanes × 1024 MAC/clk. * - L2 activation bandwidth - **~1.6 GB/s** diff --git a/docs/v002/Models/gemma3n_overview.rst b/docs/v002/Models/gemma3n_overview.rst index 130605bd4..ae4c36121 100644 --- a/docs/v002/Models/gemma3n_overview.rst +++ b/docs/v002/Models/gemma3n_overview.rst @@ -3,7 +3,7 @@ Gemma 3N E4B — Overview ======================== pccx v002 is sized for **Gemma 3N E4B** on a bare-metal Kria KV260. The -20 tok/s decoding figure is the **v002.1** release-line target (sparsity +v002.1 decoding target is a release-line target (sparsity + speculative decoding on top of the v002.0 baseline RTL); the v002.0 release line is measured-only. See :doc:`../../roadmap` for the staged release split. Before diving into the operator-level pipeline, this diff --git a/docs/v002/RTL/pccx-v002-literalinclude-migration.md b/docs/v002/RTL/pccx-v002-literalinclude-migration.md index 26ce8019b..597cd9aa5 100644 --- a/docs/v002/RTL/pccx-v002-literalinclude-migration.md +++ b/docs/v002/RTL/pccx-v002-literalinclude-migration.md @@ -85,7 +85,7 @@ changes ship together so there is no half-state. - No `git push --force` or `--force-with-lease`. - No tags pushed. - No staging push. -- No PCCX® claim, no registered-trademark claim, no private +- No registered-mark claim, no private trademark filings exposed. - No hardware/runtime/timing/bitstream claim is made by this PR; it is documentation-only. diff --git a/docs/v002/overview.rst b/docs/v002/overview.rst index 94cdc35c8..2fce71ccb 100644 --- a/docs/v002/overview.rst +++ b/docs/v002/overview.rst @@ -67,7 +67,7 @@ throughput figure is asserted until KV260 evidence is reported. - Target - Rationale * - Decoding throughput - - **20 tok/s (Gemma 3N E4B)** — v002.1 target + - Gemma 3N E4B decoding — v002.1 throughput target - Bandwidth-matched between L2 cache and the GEMV cores * - Core clock frequency - **400 MHz** diff --git a/docs/v003/gemma4-e4b-planning.md b/docs/v003/gemma4-e4b-planning.md index a8d65f95e..61564ca5c 100644 --- a/docs/v003/gemma4-e4b-planning.md +++ b/docs/v003/gemma4-e4b-planning.md @@ -37,8 +37,8 @@ runtime, or driver exists today. - No measured tokens-per-second on KV260 or any other board for Gemma 4 E4B. -- No bitstream, no timing-closed implementation. -- No production-ready runtime. +- No bitstream evidence and no implementation timing sign-off claim. +- No production runtime claim. - No ABI stability. - No driver implementation. - No accuracy / quality benchmarks. diff --git a/ko/docs/roadmap.md b/ko/docs/roadmap.md index d1ed82a3b..6e0d5c79c 100644 --- a/ko/docs/roadmap.md +++ b/ko/docs/roadmap.md @@ -43,7 +43,7 @@ KV260 bring-up `[HW]` → 런타임 `[HW]` → 릴리스 증거 체크리스트 - 동일 RTL 저장소 (`pccx-FPGA-NPU-LLM-kv260`) 에서 v002.0 의 후속 - G sparsity / H–H+ EAGLE-3 / I SSD / J Tree / K benchmark 단계 -- 20 tok/s 목표는 이 릴리스 라인 위에 위치 +- v002.1 처리량 목표는 이 릴리스 라인 위에 위치 - EAGLE head 학습용 컴퓨트 예산: $70–100 (TRC TPU grant 가 들어오면 $40) diff --git a/ko/docs/v002/Models/gemma3n_attention_rope.rst b/ko/docs/v002/Models/gemma3n_attention_rope.rst index 4c8d4a656..c93d984f5 100644 --- a/ko/docs/v002/Models/gemma3n_attention_rope.rst +++ b/ko/docs/v002/Models/gemma3n_attention_rope.rst @@ -124,7 +124,7 @@ Gemma 3N 은 어텐션 블록에서 두 가지 선택을 표준 Transformer 와 두 단순화로 레이어당 어텐션 블록마다 **CVO_SCALE 1 개** + **CVO_TANH 1 개** 가 줄어듭니다. Gemma 3N E4B 35 레이어 기준으로 토큰당 70 회의 -CVO 호출 절감. v002.1 처리량 목표 (~20 tok/s; :doc:`../../roadmap` +CVO 호출 절감. v002.1 처리량 목표 (:doc:`../../roadmap` 참고) 기준으로 SFU 예산에서 약 2–3 % 의 시간 이득입니다. .. seealso:: diff --git a/ko/docs/v002/Models/gemma3n_execution.rst b/ko/docs/v002/Models/gemma3n_execution.rst index e902ebae4..71ff5a871 100644 --- a/ko/docs/v002/Models/gemma3n_execution.rst +++ b/ko/docs/v002/Models/gemma3n_execution.rst @@ -205,7 +205,7 @@ end-to-end 디코드 목표: .. note:: - 20 tok/s 수치는 **v002.1** 릴리스 라인의 목표 (v002.0 베이스라인 + v002.1 처리량 목표는 릴리스 라인의 목표 (v002.0 베이스라인 RTL 위에 sparsity + speculative decoding 적층) 입니다. v002.0 릴리스 라인은 측정만 (measured-only) — KV260 보드 근거가 보고되기 전까지 처리량 수치를 주장하지 않습니다. @@ -219,7 +219,7 @@ end-to-end 디코드 목표: - 목표 - 병목 원인 * - 디코드 처리량 - - **20 tok/s** — v002.1 목표 + - v002.1 처리량 목표 - 400 MHz × 4 레인 × 1024 MAC/clk 에서 GEMV 대역폭. * - L2 활성화 대역폭 - **~1.6 GB/s** diff --git a/ko/docs/v002/Models/gemma3n_overview.rst b/ko/docs/v002/Models/gemma3n_overview.rst index 7ea61792a..5f918f433 100644 --- a/ko/docs/v002/Models/gemma3n_overview.rst +++ b/ko/docs/v002/Models/gemma3n_overview.rst @@ -3,7 +3,7 @@ Gemma 3N E4B — 개요 ======================== pccx v002 는 베어메탈 Kria KV260 에서 **Gemma 3N E4B** 를 돌리는 것을 -기준으로 설계되었습니다. 20 tok/s 디코딩 수치는 **v002.1** 릴리스 +기준으로 설계되었습니다. v002.1 디코딩 목표는 릴리스 라인의 목표 (v002.0 베이스라인 RTL 위에 sparsity + speculative decoding 적층) 입니다 — v002.0 릴리스 라인은 측정만 (measured-only) 입니다. 단계별 릴리스 구분은 :doc:`../../roadmap` 참고. 연산자 수준 diff --git a/ko/docs/v002/overview.rst b/ko/docs/v002/overview.rst index 5f6a4bafd..4b2139d4e 100644 --- a/ko/docs/v002/overview.rst +++ b/ko/docs/v002/overview.rst @@ -65,7 +65,7 @@ - 목표 - 근거 * - 디코딩 처리량 - - **20 tok/s (Gemma 3N E4B)** — v002.1 목표 + - Gemma 3N E4B 디코딩 — v002.1 처리량 목표 - L2 캐시 — GEMV 코어 사이 bandwidth 매칭 * - 코어 동작 주파수 - **400 MHz**