`GenerationParameters`

Every Model.generate(...) call is configured by a GenerationParameters value. Each family declares its own defaults via the family Variant protocol; the user either uses them as-is, mutates fields, or constructs their own.

The fields

Field	Default	Honored today	Notes
`maxTokens: Int`	`256`	✅	Hard cap on generated tokens.
`stopOnEOS: Bool`	`true`	✅	Stop at the model's `eosTokenId`.
`extraStopTokens: Set<Int>`	`[]`	✅	Additional stop ids beyond EOS.
`prefillStepSize: Int`	`1024`	✅	Chunk size for `engine.forwardMulti(...)` during prefill. Per-family defaults are tuned (Gemma 4 = 4096, Mamba2 / FalconH1 / Granite4 / Jamba / NemotronH = 256, …); override only if you know the family.
`temperature: Float`	`0.6`	✅	`0` → greedy (GPU argmax fast path, no logits readback). `> 0` with no filters → GPU `softmax_categorical_sample` kernel, no logits readback. `> 0` with any filter (top-K / top-P / min-P / rep-penalty) → CPU sample (one logits readback per token, ~30% decode-tok/s tax).
`topP: Float`	`1.0`	✅	Nucleus cutoff. `1.0` = disabled. Forces CPU sample path when set.
`topK: Int`	`0`	✅	`0` = disabled. Forces CPU sample path when set.
`minP: Float`	`0.0`	✅	Qwen-style min-P cutoff: keep tokens with prob ≥ `min_p × max_prob`. Forces CPU sample path.
`repetitionPenalty: Float`	`1.0`	✅	Hugging-Face convention — divide logit by penalty when seen + logit > 0; multiply when seen + logit < 0. `1.0` = disabled. Forces CPU sample path.
`presencePenalty: Float`	`0.0`	🚧 planned	Additive penalty applied once a token id has been emitted at all. Field is on `GenerationParameters` today but the CPU sampler doesn't read it yet; the wiring lands alongside the remaining filtered-sampling GPU kernels. Tracked in `planning/plan.md`.
`seed: UInt64?`	`nil`	✅	When set, the CPU sample path is reproducible run-to-run (SplitMix64 PRNG seeded by this). Ignored on the greedy path (no RNG draw).

Generate picks the cheapest path that produces correct output for the supplied parameters. Sampling is GPU-resident on the fast paths — Ops.argmax and Ops.softmaxCategoricalSample both write a single uint32 (the chosen token id) on-device, so only 4 bytes cross CPU↔GPU per decode token. Logits never leave the GPU on these paths. The CPU sample path is the fallback for filters we don't yet have GPU kernels for:

Path	Triggered by	Cost per token
`greedy-GPU`	`temperature == 0`, no filters	GPU `argmax` + 4-byte readback (fastest).
`gpu-categorical`	`temperature > 0`, no filters	Forward + `softmax_categorical_sample` GPU kernel + 4-byte readback. Two cmdbufs today (forward + sample); per-family fusion is a follow-up.
`cpu-sample`	Any of `topK > 0`, `topP < 1`, `minP > 0`, `repetitionPenalty != 1`	Forward + full vocab readback (~300 KB at Qwen 3 fp16, trivial on unified memory) + `Sampling.sample(...)` pipeline.

The GPU categorical kernel (softmax_categorical_sample) runs a cooperative 256-thread reduction for max + sum-exp, then a single-threaded inverse-CDF walk. The single-thread walk is the ~150µs bottleneck at vocab=152K; a parallel prefix-scan version is the natural follow-up alongside per-family forwardSampleCategorical fusion.

GPU top-K / top-P / min-P / rep-penalty kernels are deferred — they need a sort or radix-select, which is a substantial follow-up. Until those land, setting any filter falls back to the CPU sample path.

Family defaults

Each family's Variant protocol declares a static defaultGenerationParameters: GenerationParameters that captures the values that family ships with. The Model instance carries the resolved value as model.defaultGenerationParameters.

let model = try await Model.load("mlx-community/Qwen3-4B-4bit")

print(model.defaultGenerationParameters.topP)          // 0.95   (Qwen 3)
print(model.defaultGenerationParameters.topK)          // 20     (Qwen 3)
print(model.defaultGenerationParameters.prefillStepSize) // 1024

Current values (pulled from each family's defaultGenerationParameters literal in Sources/FFAI/Models/):

Family	`maxTokens`	`prefillStepSize`	`temperature`	`topP`	`topK`	`repPenalty`
`LlamaDense`	256	1024	0.6	1.0	0	1.0
`MistralDense`	256	1024	0.6	1.0	0	1.0
`PhiDense`	256	1024	0.6	1.0	0	1.0
`Qwen2Dense`	256	1024	0.6	1.0	0	1.0
`Qwen3Dense`	256	1024	0.6	0.95	20	1.0
`Qwen35*` (Hybrid / MoE / GDN)	256	1024	0.0	1.0	0	1.0
`Gemma2Dense`	256	1024	1.0	0.95	64	1.0
`Gemma3Dense`	256	1024	1.0	0.95	64	1.0
`Gemma4*` (Dense / E / MoE)	256	4096	1.0	0.95	64	1.0
`GPTOSSText`	256	2048	0.0	1.0	0	1.0
`LFM2Dense` / `LFM2MoE`	256	1024	0.0	1.0	0	1.0
`NemotronHText`	256	256	0.0	1.0	0	1.0
`NemotronDiffusionText`	256	1024	0.0	1.0	0	1.0
`FalconH1Text`	256	256	0.0	1.0	0	1.0
`Granite4Text`	256	256	0.0	1.0	0	1.0
`JambaText`	256	256	0.0	1.0	0	1.0
`Mamba2Text`	256	256	0.0	1.0	0	1.0
`Idefics3` (VL)	256	1024	0.0	1.0	0	1.0
`Paligemma` (VL)	256	512	0.0	1.0	0	1.0
`SmolVLM2` (VL)	256	1024	0.0	1.0	0	1.0

minP is 0.0 for every family today (a deliberate opt-in); CLI / API callers can pass it explicitly. The GPT-OSS reasoningLevel defaults to .none. For any family not listed, read its variant struct in Sources/FFAI/Models/Text/<F>Text.swift (or the Models/<F>.swift root for VL orchestrators) — or run ffai inspect <repo> to dump the resolved values. The Llama / Qwen 3 defaults match mlx-swift-lm's per-family GenerationParameters baseline and defaultPrefillStepSize for the same architectures. Audio families don't ship GenerationParameters (they're not text generators); STT / TTS / VAD configuration lives on their own modality-specific options surface. When porting a new family, declare its defaults — see developing/adding-a-model.md § Step 4: family defaults.

Three ways to call `generate`

1. Family default

let result = try await model.generate(prompt: "Once upon a time")

parameters defaults to nil, which falls back to model.defaultGenerationParameters.

2. Override one field

The with(_:) copy-mutator keeps the family-tuned baseline and edits a single knob:

let result = try await model.generate(
    prompt: "Once upon a time",
    parameters: model.defaultGenerationParameters.with { $0.maxTokens = 64 }
)

This is the recommended call shape — you don't lose the family-tuned sampling values just because you wanted a shorter generation.

3. Custom from scratch

let params = GenerationParameters(
    maxTokens: 1024,
    temperature: 0.0,        // greedy
    topP: 1.0,
    repetitionPenalty: 1.05
)
let result = try await model.generate(prompt: "...", parameters: params)

Any field you don't pass picks the GenerationParameters.init default, not the family default. Use with(_:) (#2) when you want the family-tuned baseline.

CLI behaviour

Each CLI flag overrides only its corresponding GenerationParameters field; every other knob still picks up the family default. Omit all sampling flags to use the family value:

# Family defaults — Qwen 3 ships temperature=0.6 / top-p=0.95 / top-k=20.
# Routes through the CPU sample path (non-greedy):
ffai --model mlx-community/Qwen3-1.7B-4bit --prompt "Hello"

# Greedy fast path (GPU argmax, 4-byte readback per token, no sampling):
ffai --model mlx-community/Qwen3-1.7B-4bit --prompt "Hello" \
     --temperature 0 --max-tokens 64

# Seeded non-greedy — reproducible run-to-run:
ffai --model mlx-community/Qwen3-1.7B-4bit --prompt "Hello" \
     --temperature 0.7 --top-p 0.9 --seed 42 --max-tokens 64

# Full sampling pipeline:
ffai --model mlx-community/Qwen3-1.7B-4bit --prompt "Hello" \
     --temperature 0.8 --top-k 40 --top-p 0.95 --min-p 0.05 \
     --repetition-penalty 1.05 --seed 12345

Available flags: --temperature, --top-k, --top-p, --min-p, --repetition-penalty, --seed. Plus --max-tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GenerationParameters`

The fields

Family defaults

Three ways to call `generate`

1. Family default

2. Override one field

3. Custom from scratch

CLI behaviour

See also

FilesExpand file tree

generation-parameters.md

Latest commit

History

generation-parameters.md

File metadata and controls

GenerationParameters

The fields

Family defaults

Three ways to call generate

1. Family default

2. Override one field

3. Custom from scratch

CLI behaviour

See also

`GenerationParameters`

Three ways to call `generate`