Implicitly Adaptive Refinement Model — Version V
A lightweight, iterative-refinement language model that learns to “fill-in-the-blanks” starting from a fully-masked sequence — now with an internal, learnable refinement gate. Works with any tokenizer that supplies mask_token_id, pad_token_id, and (optionally) eos_token_id.
Instead of left-to-right generation, the model treats text generation as a denoising process:
- Start with every token =
[MASK] - Run a small, shared transformer for ≤ K steps
- At each step, only re-predict tokens the model itself deems uncertain
- Freeze tokens once an
[EOS]is sampled; stop early when < τ tokens change
The training objective is a masked-language-modeling loss with a time-dependent corruption schedule:
mask_rate(t) = 1 − t / K.
In Version V, refinement decisions are made internally:
- When
use_refine_gate=True, a lightweight gate head predicts a per-token refinement probability. - Tokens are updated iff
refine_gate > 0.5— no external entropy threshold needed. - The gate is trained end-to-end and initialized to refine by default.
🔍 Refinement Trajectory
t=0: [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
t=1: [MASK] [ [5]] [ [5]] [ [5]] [ [5]] [ [6]] [ [5]] [ [5]] [[28]] [ [6]] [ [5]] [ [5]] [ [5]] [ [5]] [ [6]] [ [7]]
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
t=2: [MASK] [ [9]] [ [7]] [ [5]] [ [5]] [[11]] [ [6]] [ [5]] [[11]] [ [9]] [ [8]] [ [5]] [ [7]] [ [7]] [ [9]] [[12]]
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
t=3: [MASK] [ [5]] [ [7]] [ [9]] [ [5]] [[10]] [ [5]] [[14]] [ [8]] [ [8]] [ EOS] [ [5]] [ [5]] [ [5]] [ [6]] [[12]]
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
t=4: [MASK] [ [5]] [ [9]] [ [5]] [[13]] [ [5]] [[13]] [ [5]] [ [8]] [[14]] [ EOS] [ [5]] [ [5]] [ [5]] [ [6]] [[12]]
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
← change_ratio=0.0% → ✅ Early stop
Final: '<mask> [C] [O] [C] [Ring2] [C] [Ring2] [C] [Branch1] [Branch2] </s>'
| Component | Purpose | Key hyper-params |
|---|---|---|
TokenEmbedding |
learned input embeddings | vocab_size, hidden_size |
AdaptivePositionalEmbedding |
sinusoidal PE × learned per-position decay | max_seq_len |
TimeEmbedding |
scalar step → vector (1-layer MLP) | hidden_size |
Self-condition projection |
soft previous logits → residual input | optional |
Transformer blocks |
full self-attention (shared across steps) | num_layers, num_heads, dropout |
Refinement Gate (V-only) |
predicts per-token refine/no-refine | sigmoid head, bias-init to +2.0 |
Teacher (EMA) |
exponential moving average for stable uncertainty (used only when gate is off) | ema_decay |
Imadarem V supports two refinement strategies:
| Mode | Trigger | Controlled by |
|---|---|---|
| Uncertainty Threshold | entropy > min_refine_uncertainty |
use_refine_gate=False |
| Internal Gate (default in V) | refine_gate > 0.5 |
use_refine_gate=True |
Both respect [EOS] freezing and early stopping via stop_threshold.
| Hyper-param | Meaning | Default |
|---|---|---|
max_refinement_steps |
hard cap on iterations | 6 |
sampling_temperature |
softmax temperature during sampling | 1.2 |
min_refine_uncertainty |
entropy threshold (gate mode ignores this) | 0.1 |
stop_threshold |
early stop if < % tokens change | 0.02 |
use_refine_gate |
enable internal learned gate | True |
Required special IDs (auto-detected):
tokenizer.mask_token_id # must exist
tokenizer.pad_token_id # fallback: 0
tokenizer.eos_token_id # fallback: sep_token_id, else NoneCollision check is performed at model init.
config = ImplicitRefinementConfig(
vocab_size=100,
hidden_size=64,
num_layers=2,
max_seq_len=8,
max_refinement_steps=3,
stop_threshold=0.05,
diversity_weight=0.1,
sampling_temperature=1.0,
use_refine_gate=use_gate
)
model = ImplicitRefinementModel(config, tokenizer=tokenizer)
model.init_teacher()✅ Pros
- Non-autoregressive → fully parallel sampling
- Learned refinement policy (no hand-tuned entropy thresholds)
- Early stopping enables variable-length outputs
- EMA teacher stabilizes uncertainty (when gate is off)
- Compatible with any subword or character tokenizer
❌ Cons
- Still work in progress and evaluation
- Output length capped by
max_seq_len - No explicit mechanism for long-range coverage or input conditioning (e.g., prompts)
- Ranger21 Optimizer:
@article{wright2021ranger21,
title={Ranger21: a synergistic deep learning optimizer},
author={Wright, Less and Demeure, Nestor},
year={2021},
journal={arXiv preprint arXiv:2106.13731},
}Note: Imadarem V unifies refinement control inside the model, eliminating the need for external meta-policies. The internal gate is lightweight, end-to-end trainable, and simplifies deployment.