diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 000000000..7c6e5501f --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,575 @@ +# Glossary + +This page defines project terms as they are used in the pccx public +documentation. It is intentionally conservative: planned work, throughput +targets, and board measurements are labelled as such. + +## Project And Release Lines + +pccx +: Parallel Compute Core eXecutor. A hardware-software co-design project for + NPU architectures targeting edge inference workloads. + +v001 +: Archived experimental pccx architecture line. It remains in the docs as + historical context and should not be treated as the active RTL target. + +v002 +: Active KV260 LLM architecture line. In this docs site, `v002` usually means + the public architecture, ISA, driver, RTL-reference, and verification pages + for the current `pccx-FPGA-NPU-LLM-kv260` line. + +v002.0 +: Baseline v002 integration line on KV260. Throughput language for this line is + measured-only until release evidence is published. + +v002.1 +: Planned continuation of v002 on the same RTL repository. The roadmap scopes + sparsity and speculative-decoding work to this line. The 20 tok/s number is a + target for this line, not a reported board result. + +v003.x +: Planned LLM continuation in a separate RTL repository. Public documentation + treats v003 as a future line until its repository and release branches are + stabilized. + +vision-v001 +: Parallel CNN inference track that reuses the KV260 substrate but targets + vision workloads rather than autoregressive LLM decoding. + +pccx-lab +: Companion verification and profiling environment for pccx traces, reports, + and workflow automation. Public claims derived from lab output still need the + release evidence gates described in the roadmap. + +pccx-llm-launcher +: Companion launcher repository for model preparation, runtime contracts, and + KV260-facing orchestration. Current public launcher pages describe scaffold, + mock, and contract surfaces unless they cite board evidence. + +## Hardware Target + +KV260 +: Xilinx Kria KV260 Starter Kit, based on the Zynq UltraScale+ ZU5EV device. + It is the primary board target for v002 public documentation. + +`kv260` +: Lowercase slug used in repository names, branch names, build directories, or + scripts when a filesystem-safe target identifier is needed. + +Zynq UltraScale+ +: AMD/Xilinx SoC family that combines a Processing System and Programmable + Logic fabric. The KV260 target uses a ZU5EV part. + +PS +: Processing System. The Arm-based host side of the Zynq device. + +PL +: Programmable Logic. The FPGA fabric side where the pccx NPU RTL is + implemented. + +AXI +: Arm AMBA interconnect protocol family used for host, memory, and streaming + interfaces in the design. + +AXI-HP +: High-Performance AXI ports from the PS to PL. In v002 documentation these + ports are used for high-bandwidth weight traffic into the NPU. + +ACP +: Accelerator Coherency Port. In pccx docs, ACP refers to the coherent path + used for activation/result traffic between host memory and the accelerator. + +DSP48E2 +: Xilinx DSP slice available in UltraScale+ devices. pccx v002 uses DSP48E2 + packing for the W4A8 GEMM datapath. + +BRAM +: Block RAM in the FPGA fabric. pccx uses BRAM for smaller local buffers and + per-core storage structures. + +URAM +: UltraRAM in the FPGA fabric. pccx v002 uses URAM for the shared L2 cache and + weight buffering structures described in the architecture docs. + +CDC +: Clock-domain crossing. Used where data moves between the AXI/control clock + domain and the core compute clock domain. + +Vivado block design +: Xilinx Vivado IP-integrator design graph. In the v002.1 docs, a block-design + scaffold is build setup material, not proof that implementation or timing has + completed. + +bitstream +: FPGA configuration artifact produced after synthesis and implementation. + Public pccx docs should call a bitstream deployable only when the matching + evidence page or release checklist links the build, timing, and board + artefacts. + +SD staging +: Packaging step that prepares files for booting or testing the KV260 from SD + media. It is a deploy-preparation step and does not by itself establish a + hardware run. + +## Data Types And Numeric Formats + +W4A8 +: Weight-4, Activation-8 quantization. In pccx v002 this means INT4 weights + multiplied by INT8 activations on the main integer compute path. + +W4A8KV4 +: Shorthand used for an evidence-gated Gemma 3N E4B target configuration: + W4A8 compute with 4-bit KV-cache storage. Treat it as a target configuration + label unless a page cites measured evidence. + +INT4 +: Signed 4-bit integer value, used for quantized weights in the W4A8 path. + +INT8 +: Signed 8-bit integer value, used for quantized activations in the W4A8 path. + +BF16 +: Brain floating point format with an 8-bit exponent and 7-bit mantissa. pccx + docs use BF16 for activation, KV-cache, or SFU paths where integer-only + arithmetic is not the intended representation. + +FP32 +: IEEE single-precision floating point. Public docs mention FP32 only where the + operation needs a higher-precision software or SFU-side representation. + +Precision promotion +: Conversion from the integer compute path to BF16 or FP32 for non-linear or + numerically sensitive operations such as softmax, RMSNorm, GELU, and RoPE. + +Sign recovery +: The correction step used when signed low-bit operands are packed into a + wider multiply datapath. In pccx docs the term is tied to W4A8 DSP packing, + not to model-level accuracy claims. + +Activation quantization +: Policy for converting activation values into the representation consumed by + the integer datapath. The v002.1 decision page names the default policy but + does not claim final model accuracy. + +`e_max` +: Maximum-exponent summary used by the v002.1 activation-scale policy. Public + docs describe it as a scale-selection mechanism, not as measured accuracy or + throughput evidence. + +BFP +: Block floating point. In the v002.1 activation policy, BFP refers to a shared + power-of-two activation scale for a block of values. + +symmetric INT8 +: Reviewed activation-scale mode that uses symmetric signed INT8 quantization. + The design-decision page keeps it as a mode under review rather than the + v002.1 default. + +constant-cache scale +: Driver-provided activation-scale table or constant path. It remains a + reviewed mode until the hardware/software interface and tests make it the + chosen default. + +`ACT_SCALE_POLICY` +: Public parameter handle for the v002.1 activation scaling policy. + +`ACT_SCALE_EMAX_BFP` +: Default v002.1 activation-scale mode named by the design-decision page: + `e_max` plus BFP power-of-two scaling. + +## Compute Blocks + +GEMM +: General Matrix-Matrix Multiply. In v002 it is the matrix core used mainly for + prefill and other matrix-heavy work. The architecture docs describe a 32 x 32 + systolic array for the KV260 configuration. + +GEMV +: General Matrix-Vector Multiply. In v002 it is the vector core used for + decode-dominant work where a new token repeatedly multiplies an activation + vector by streamed weights. + +CVO +: Complex Vector Op. ISA opcode family for non-linear vector operations and + reductions that execute on the SFU path. + +SFU +: Special Function Unit. The backend that executes CVO operations such as exp, + sqrt, GELU, sin/cos, reduce-sum, scale, and reciprocal. + +PE +: Processing Element. A compute cell in the systolic array or related datapath. + +Systolic array +: Regular grid of PEs that moves operands through a fixed pattern. In pccx v002 + public docs, this term usually refers to the GEMM array. + +Weight Stationary +: GEMM dataflow where a weight tile is loaded into the array and reused across + many activation steps. + +Weight Streaming +: GEMV dataflow where weights stream through the vector datapath because each + weight is used once for the current token step. + +LUT +: Lookup table. In the FPGA sense, LUTs are logic resources. In the algorithmic + sense, pccx docs also use lookup tables for some dequantization or SFU helper + paths; read the local context. + +CORDIC +: Iterative coordinate-rotation method used for selected transcendental + functions. pccx docs mention CORDIC as part of the SFU implementation path. + +K-split +: Division of the reduction dimension into chunks. v002.1 docs discuss it with + drain cadence and accumulator bounds, not as a completed scheduler claim. + +drain cadence +: Frequency at which partial accumulators are drained from a K-split path. + The current v002.1 default is parameterized rather than hardwired into a + public performance claim. + +`K_DRAIN_LIMIT` +: Public parameter handle for the v002.1 K-split accumulator drain limit. The + documented default is `1024`. + +DSP accounting baseline +: Convention for reporting intended compute-core DSP usage separately from + implementation extras. Actual utilization still comes from synthesis reports. + +`DSP_BASELINE_GEMM` +: GEMM compute-core DSP baseline parameter. The v002.1 decision page sets it + to `1024` for the 32 x 32 PE grid. + +`DSP_BASELINE_GEMV` +: GEMV compute-core DSP baseline parameter. The v002.1 decision page sets it + to `64` for four 16-DSP vector lanes. + +`DSP_BASELINE_ALPHA` +: Accounting bucket for implementation extras outside the GEMM/GEMV baseline. + +## ISA And Runtime Terms + +ISA +: Instruction Set Architecture. pccx v002 uses a custom fixed-width 64-bit ISA + for compute, memory, and CVO instructions. + +VLIW +: Very Long Instruction Word. In pccx docs this describes the fixed-width + instruction format and explicit fields used by the NPU dispatcher. + +opcode +: Operation-code field in an instruction. The v002 ISA pages are the source of + truth for opcode values and instruction field layouts. + +GEMM instruction +: v002 ISA compute instruction that dispatches matrix-matrix work to the GEMM + backend. + +GEMV instruction +: v002 ISA compute instruction that dispatches matrix-vector work to the GEMV + backend. + +MEMCPY instruction +: v002 ISA memory movement instruction. See the ISA reference for supported + source and destination paths. + +MEMSET instruction +: v002 ISA instruction used to write shape or constant-table state rather than + to run arithmetic. + +CVO instruction +: v002 ISA instruction that dispatches an SFU function over a vector or + reduction operand. + +HAL +: Hardware Abstraction Layer. The C/C++ driver layer that wraps register, + memory, and instruction-dispatch details for host software. + +Sail +: ISA-specification language used by the pccx formal model. In pccx docs, Sail + models are used to check instruction semantics and field widths against the + intended ISA structure. + +launcher contract +: Data-only interface between the planned KV260 runtime path and launcher + software. A contract page describes shapes and guardrails; it is not board + execution evidence. + +readiness scaffold +: Typed placeholder or adapter surface that makes a future hardware path + reviewable before device access is implemented. + +AXI command/status shapes +: Launcher-side data structures for command and status exchange over the + future KV260 boundary. Shape validation is contract evidence, not a live + MMIO run. + +result streaming +: Runtime path for returning generated tokens or accelerator results. Public + docs should distinguish mock streams, serial test framing, and captured board + streams. + +serial TTY +: Character-device path used by launcher or lab tooling to exchange framed + records with a connected target. Tests that skip without a device are not + board evidence. + +TraceStream +: pccx-lab iterator contract for trace records. File replay and serial TTY + sources can share this surface while still having different evidence status. + +`KVFPGA_TTY` +: Environment or configuration path naming the serial device used by the KV260 + trace source. + +newline JSON framing +: Trace framing style where one JSON payload is carried per line between + begin/end markers. + +CRC +: Cyclic redundancy check. In pccx-lab trace framing docs it is used to detect + corrupted payloads; skipped bad frames should not be counted as valid + hardware evidence. + +sequence gap +: Missing trace-frame sequence number reported by the lab pipeline. It is a + diagnostic signal that the captured stream may be incomplete. + +## Memory And Model Terms + +L1 +: Local per-core memory or buffer close to a compute backend. + +L2 +: Shared on-chip cache in the v002 architecture. It is backed by URAM and is + shared by GEMM, GEMV, SFU, and memory-dispatch paths. + +Weight Buffer +: On-chip FIFO/buffer path for model weights arriving from external memory. + GEMM uses it for weight preload/reuse; GEMV uses it for streaming. + +KV cache +: Attention key/value storage retained across autoregressive decoding steps. + pccx docs distinguish KV-cache design targets from measured board capacity + or throughput claims. + +Attention Sink +: KV-cache policy term for retaining the first tokens of a prompt while using a + sliding local window for recent tokens. + +Local Window +: KV-cache policy term for the recent-token region retained during long-context + decoding. + +RoPE +: Rotary Position Embedding. pccx maps RoPE-related sine and cosine work to CVO + operations in the SFU path. + +RMSNorm +: Root Mean Square Layer Normalization. In pccx docs this is one of the + non-linear or reduction-heavy operations associated with the SFU path. + +Softmax +: Normalization used in attention. pccx docs map its exponential, reduction, + reciprocal, and scale steps to CVO/SFU operations. + +GELU +: Gaussian Error Linear Unit activation. pccx docs map GELU to the CVO/SFU + path. + +Gemma 3N E4B +: Target LLM family named in the v002 public docs. Claims about token rate or + board execution remain evidence-gated unless the page cites published + verification data. + +GemmaArchSpec +: Launcher-side configuration object for Gemma shape metadata and packed-size + checks. It is a spec-validation surface, not a model execution claim. + +W4 prep +: Launcher-side preparation of signed W4 packed weights and related metadata. + Current docs treat it as a software contract until hardware handoff evidence + lands. + +manifest metadata +: Structured metadata that records prepared weight shapes, scales, packed + sizes, or related handoff fields for the launcher path. + +tokenizer contract +: Offline tokenizer interface used by the launcher scaffold. Placeholder + fixtures do not claim real Gemma tokenizer assets. + +token streaming +: Movement of prompt or generated-token data across a runtime boundary. In the + current software-path docs, serial and mock streaming are scaffold evidence + until board captures are published. + +marker-wrapped chunks +: Token-transport records delimited by explicit markers, sometimes with length + prefixes. They define framing behavior rather than hardware throughput. + +mock orchestration +: End-to-end software path that joins prompt encode, W4 prep, mock command + polling, output receive, and decode without a real board run. + +AltUp +: Gemma-specific multi-stream state item named in v002.1 FAQ material. Its + effect on throughput or memory pressure still needs measured evidence before + public claims. + +LAuReL +: Gemma-specific mechanism named in model and FAQ pages. Public docs may + describe the mapping, but speedup or accuracy claims need evidence. + +PLE +: Per-Layer Embedding mechanism referenced by Gemma model docs. Treat + PLE-related scheduling text as design mapping unless an evidence page links a + measurement. + +grouped-query attention +: Attention variant that shares key/value projections across query groups. + pccx docs discuss it as part of the Gemma mapping and KV-cache traffic + budget. + +cross-layer KV sharing +: Gemma-specific KV reuse pattern that affects cache residency and traffic. + Public docs should keep it separate from measured throughput claims. + +EAGLE-3 +: Speculative-decoding technique named in the v002.1 roadmap scope. In this + repo it is planned work, not a completed v002.0 feature. + +SSD +: Speculative-decoding roadmap item in the v002.1 scope. Expand or redefine + the acronym at the point of use when adding detailed public documentation. + +J Tree +: Roadmap shorthand associated with the v002.1 speculative-decoding stack. + Treat it as planned scope until a design page defines and verifies it. + +G sparsity +: Roadmap lane for v002.1 sparsity work. It should be described as ramp scope + until implementation and evidence pages say more. + +H/H+ +: Roadmap shorthand for EAGLE-3 speculative-decoding phases in the v002.1 + ramp. + +I SSD +: Roadmap shorthand for the SSD phase in the v002.1 ramp. + +K benchmark +: Roadmap shorthand for benchmark/evidence work after the v002.1 mechanism + lanes. Benchmarks become public claims only through the evidence gates. + +## Metrics And Evidence + +tok/s +: Tokens per second. pccx uses this as the primary user-visible decoding + throughput unit. + +TT +: Throughput target. This is planning shorthand for a target token rate, not a + measurement. Public pages should prefer spelling out "throughput target" on + first use. + +measured-only +: Documentation posture for the v002.0 release line: do not quote throughput, + timing closure, or board-run claims until the evidence checklist admits those + measurements. + +bring-up +: Hardware integration phase where the bitstream, board setup, host driver, + and smoke tests are made to run together. Bring-up logs are evidence inputs, + not automatically release claims. + +release evidence +: Checklist-gated artifacts used to decide whether timing, throughput, or + board-execution statements are allowed in public docs. + +evidence inventory +: Public list of measured, reproducible artefacts and pending gates. It is the + place to check whether a value is measured, pending, or only a target. + +claim guard +: Review rule or scan that prevents public docs from turning targets, + scaffolds, mocks, or pending gates into completed hardware claims. + +pre-flight +: Preparatory state for build, launcher, or deploy work before the full command + sequence has been run and evidence has been captured. + +smoke capture +: Small board or tool run used to collect initial logs. It can support bring-up + evidence, but it does not replace release evidence for timing or throughput. + +timing report +: Vivado report used to justify timing wording. A docs page should not claim + timing closure without a linked report or release evidence entry. + +utilization report +: Vivado report used to justify FPGA resource wording such as DSP, LUT, BRAM, + or URAM counts. + +throughput target +: Planned token-rate goal. It must remain distinct from measured throughput in + public wording. + +board run +: Execution against a connected KV260 or other named target board. Mock tests, + type checks, and local software orchestration are not board runs. + +trace replay +: Analysis of an existing `.pccx` trace file through pccx-lab tooling. Replay + can validate analysis paths without proving new hardware execution. + +## Documentation And Release Terms + +spec resolution +: Reader step that separates architecture intent, model mapping, ISA source of + truth, and measured evidence before quoting a claim. + +runbook +: Step-by-step command record for a build, local docs check, deploy, or + hardware procedure. A runbook is procedure evidence only after the commands + and results are captured. + +deploy runbook +: Documentation path for publishing the Sphinx site through GitHub Pages. A + deploy check proves publication, not hardware performance. + +release status +: Label such as draft, prerelease, latest release, or archived release used by + release notes. It should not be overloaded with hardware readiness. + +pre-release +: GitHub Release state for work that is published before being treated as a + final release. + +validation status +: Release-note field that records which checks passed, failed, or were not run. + It should name commands or CI runs where useful. + +known limitations +: Release-note section for caveats, missing evidence, or deferred capability. + +release checklist +: Maintainer checklist for release hygiene. For pccx ISA PDF changes, the + checklist includes rebuilding the PDF from `main.tex`. + +GitHub Pages deploy +: Publication workflow for the documentation site. Passing deploy does not + convert a target, mock, or pending gate into measured evidence. + +contributors acknowledgement +: Public recognition of people who contribute documentation, reviews, bug + reports, diagrams, examples, or related code after maintainers accept the + entry for publication. + +news section +: Placeholder area for future project updates, release announcements, and + community news. It should not carry release claims without the same evidence + gates as the rest of the docs. diff --git a/docs/index.rst b/docs/index.rst index e65147752..bc0285639 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -101,6 +101,8 @@ Working tracks for the next release lines: The :doc:`roadmap` summarises how the three tracks relate, and the ``pccx`` family-tree figure on that page links them visually. +The :doc:`glossary` defines project and v002 architecture terms used +across the public docs. The v001 architecture is archived at :doc:`archive/experimental_v001/index`. @@ -202,3 +204,4 @@ risks, keeping the ecosystem safe for open-source hardware development. v003/index vision-v001/index + glossary diff --git a/index.rst b/index.rst index e09d1b39e..8febfeab8 100644 --- a/index.rst +++ b/index.rst @@ -110,6 +110,7 @@ Tooling & Lab docs/index docs/quickstart + docs/glossary docs/Evidence/index docs/roadmap