Skills:
.agents/skills/cosmos3-setup/SKILL.md·.agents/skills/cosmos3-inference/SKILL.md
A catch-all collection of frequently asked questions, tips, and troubleshooting for the Cosmos3 package. Can't find what you need? Check setup.md for installation issues or inference.md for inference details.
To add a new entry, append it under the most relevant section — or under Miscellaneous if nothing fits.
Q: I get ImportError: cannot import name '_functionalization' from 'torch._C' inside an NGC container
Clear the library path before running anything:
export LD_LIBRARY_PATH=''This is needed because the NGC PyTorch container ships its own libraries that conflict with the venv-installed versions. See setup.md#pytorch-import-issue.
Make sure you installed the package:
uv sync --all-extras --group=cu130
source .venv/bin/activateIf already installed, try --reinstall to force a clean state.
CUDA 13.0 is recommended. CUDA 12.8 is also supported. The major version must match between your system CUDA and the installed PyTorch wheels. Check with:
nvidia-smi # system CUDA
python -c "import torch; print(torch.version.cuda)" # PyTorch CUDAQ: I get a CUDA / NVIDIA driver mismatch error (e.g. CUDA error: no kernel image is available for execution on the device, libcudart.so.* cannot open, The NVIDIA driver on your system is too old)
The installed PyTorch CUDA wheels do not match your system's NVIDIA driver. Check the driver's reported CUDA version with nvidia-smi, then delete .venv/ and uv sync against the matching group (cu130-train if the driver supports CUDA 13.x; cu128-train if it supports CUDA 12.8):
nvidia-smi # check the "CUDA Version" field (top right)
rm -rf .venv
uv sync --all-extras --group=cu130-train --reinstall # or --group=cu128-train
source .venv/bin/activate && export LD_LIBRARY_PATH=Use the inference-only groups (cu130 / cu128) instead if you don't need the training-only dependencies.
Checkpoints are downloaded automatically from Hugging Face during inference. You need:
- A Hugging Face token with Read permission
- Accepted NVIDIA Open Model License Agreement
HF_TOKENenvironment variable set, oruvx hf auth login
Control the download location with HF_HOME (default: ~/.cache/huggingface). If downloads fail, the commands are printed to the console — run them manually to debug. See setup.md#downloading-base-checkpoints.
Reinstall uv and the venv from scratch:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv python install --reinstall
rm -rf .venv
uv sync --all-extras --group=cu130 --reinstall
source .venv/bin/activatePlan for ~150 GiB free before the first run. A successful first-run inference or training workflow typically consumes:
- Hugging Face cache (
$HF_HOME, default~/.cache/huggingface): ~90 GiB — base checkpoints (e.g. Cosmos3-Nano, Wan2.2 VAE), tokenizers, and any dataset snapshots pulled by training recipes. - uv cache (
$UV_CACHE_DIR, default~/.cache/uv): ~20 GiB — wheels for torch/CUDA dependencies across the install groups (cu130-train,cu128-train, etc.). - Run outputs (
$IMAGINAIRE_OUTPUT_ROOT, training, or your-ooutput dir, inference): ~30 GiB per run — config snapshots, DCP checkpoints saved everysave_freqiterations, callback outputs, optional wandb files.
Actual sizes scale with the model tier (Cosmos3-Super is larger than Cosmos3-Nano), the dataset, and how many checkpoints you keep. To relocate any of these off the system disk, set the corresponding env var before installation/run (e.g. export HF_HOME=/data/hf, export UV_CACHE_DIR=/data/uv, export IMAGINAIRE_OUTPUT_ROOT=/data/cosmos-runs).
Per-modality defaults live in JSON files under cosmos_framework/inference/defaults/<mode>/sample_args.json:
| Mode | Default file |
|---|---|
text2image |
cosmos_framework/inference/defaults/text2image/sample_args.json |
text2video |
cosmos_framework/inference/defaults/text2video/sample_args.json |
image2video |
cosmos_framework/inference/defaults/image2video/sample_args.json |
Action and image/video-to-video modes have parallel files under cosmos_framework/inference/defaults/{image2image,video2video,forward_dynamics,inverse_dynamics,policy}/sample_args.json.
See AGENTS.md for the full config defaults chain.
From most temporary to most permanent:
- CLI flag:
--shift 5.0(per-run, applies to all samples) - Sample argument file: set the field in your input JSON (per-sample)
- Custom defaults file: pass
"defaults_file": "my_defaults.json"in your sample argument file (see inference.md#custom-defaults) - Built-in default: edit
cosmos_framework/inference/defaults/<mode>/sample_args.json(permanent change)
Fields set in the sample argument file take precedence over defaults. CLI flags override both.
shift controls the time-shift in the UniPC diffusion sampler. Higher values produce more detail but can introduce artifacts. Recommended values:
| Model | Recommended shift |
|---|---|
| Cosmos3-Nano (8B) | 10.0 (default) |
| Cosmos3-Super (32B) | 5.0 |
- Add the field to
SamplingArgsandSamplingOverridesincosmos_framework/inference/args.py - Add its default to each
cosmos_framework/inference/defaults/<mode>/sample_args.json - Wire it through
OmniSampleOverrides.build_sample()incosmos_framework/inference/args.py
It lets you supply a custom JSON file of default values instead of the built-in presets. The format is the same as the files in cosmos_framework/inference/defaults/. Fields in your sample argument file still take precedence over the custom defaults. See inference.md#custom-defaults.
| Model | GPU Memory |
|---|---|
| Cosmos3-Nano (8B) | 32 GB |
| Cosmos3-Super (32B) | 128 GB |
Try these in order:
-
Reduce allocator fragmentation — usually the cheapest fix:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -
Increase
--dp-shard-sizeto shard model weights across more GPUs via FSDP. Inference auto-picks a value that fits the model at ~75% device memory (see_get_dp_shard_sizeincosmos_framework/inference/args.py); passing a larger explicit value drops per-GPU memory at the cost of more all-gather traffic. Requires multi-GPU. -
Lower
--device-memory-utilization(default0.75). The auto-dp_shard_sizeformula isceil(model_memory / device_memory / utilization), so passing e.g.--device-memory-utilization=0.5forces auto-mode to pick a largerdp_shard_sizeand leaves more per-GPU headroom for activations / KV cache. Requires multi-GPU. -
Add
--offload-guardrail-modelsto move the text and video guardrail models to CPU. Frees the GPU memory they would otherwise hold for the full run, at the cost of some extra latency when guardrails are invoked.
See inference.md#torch-cuda-out-of-memory-error for the full troubleshooting section.
| Preset | What it does | When to use |
|---|---|---|
latency |
Spreads each sample across all GPUs | Interactive / real-time use |
throughput |
One sample per GPU in parallel | Large batch jobs |
Use a text2image input file:
python -m cosmos_framework.scripts.inference -i inputs/omni/t2i.json -o outputs/ --checkpoint-path Cosmos3-NanoThe modality is determined by the input JSON (num_frames=1 for images), not by a separate flag. See inputs/omni/t2i.json for the format.
Depends on resolution:
| Resolution | Max frames |
|---|---|
| 256p | 400 |
| 480p | 300 |
| 720p | 200 |
Default is 189 frames at 24 FPS (~7.9 seconds).
Provide a vision_path pointing to an image (.jpg, .jpeg, .png) or a URL. See inputs/omni/i2v.json for the format.
Install serve dependencies and start the server:
uv pip install -e ".[serve]"
python -m cosmos_framework.inference.ray.serve --parallelism-preset=latency -o outputs/ray_serve --checkpoint-path Cosmos3-NanoThen submit requests via curl, the submit CLI, or the Gradio UI.
Add a command-line argument --no-use-torch-compile
OR
Delete the torchinductor cache under the /tmp directory, rm -rf /tmp/torchinductor_*
torchrun defaults its rendezvous to port 29500. The error means that port is already taken on the node — usually because another torchrun job (yours or someone else's on a shared node) is still using it.
Pass a different free port with --master-port, placed before -m (it is a torchrun argument, not an inference argument):
torchrun --nproc-per-node=8 --master-port=29501 -m cosmos_framework.scripts.inference \
--parallelism-preset=throughput \
-i "inputs/omni/t2i.json" \
-o outputs/omni_t2i \
--checkpoint-path Cosmos3-Super-Text2Image \
--seed=0Any free port works (e.g. 29501, 29510); give each concurrent job on the same node a distinct port. Alternatively, --rdzv-endpoint=localhost:0 lets torchrun auto-pick a free port.
Knobs are in the recipe TOML under [model], [model.parallelism], and [dataloader_train]. Try in order:
-
Reduce allocator fragmentation — usually the cheapest fix:
export PYTORCH_ALLOC_CONF=expandable_segments:True(The
_superlaunch shells already export this — seeexamples/launch_sft_*_super.sh.) -
Enable activation checkpointing in
[model.activation_checkpointing]:mode = "full"— checkpoint every transformer block (largest memory savings, trades extra recompute for memory).mode = "selective"— per-op SAC, MoT only (smaller savings, smaller overhead). Falls back to no checkpointing on the VLM path.
-
Raise
[model.parallelism].data_parallel_shard_degreeto shard weights/optimizer state across more ranks via FSDP. Runtime invariant (fromcosmos_framework/utils/vfm/parallelism.py:50-52):data_parallel_replicate_degree × data_parallel_shard_degree == WORLD_SIZEalways holds —context_parallel_shard_degreeandcfg_parallel_shard_degreeare overlay axes that share dp rank slots, not separate mesh dims. Use-1to letdata_parallel_shard_degreeauto-fill fromtorchrunworld size. -
Raise
[model.parallelism].context_parallel_shard_degreeto split the sequence dimension across ranks. Helpful when activations (not weights) drive the OOM — long videos, high resolution. -
Lower
[dataloader_train].max_samples_per_batchto cap samples per micro-batch.Nonelets the packer's token budget decide; setting an explicit small number trades throughput for headroom. -
Enable LoRA on a Cosmos3-Nano recipe. Nano recipes are full-finetune by default (
lora_enabled = false); setting[model].lora_enabled = truetrains low-rank adapters instead of the full weights, dropping optimizer-state memory substantially. The_superrecipes (e.g.vision_sft_super) are already LoRA-only, so this lever doesn't apply there.
See docs/training.md for the full SFT setup and TOML reference ([model.activation_checkpointing], [model.parallelism], [dataloader_train] sections).
Always pass --seed when comparing runs. Without it, a random seed is used each time.
Short prompts produce worse results. Use the built-in prompt upsampler with a vLLM-served Qwen3 model:
python -m cosmos_framework.scripts.upsample_prompts -i "inputs/omni/*.json" -o outputs/upsample_promptsThe inference script automatically skips samples whose output files already exist. If a run is interrupted, re-run the same command to resume.
python -m cosmos_framework.scripts.inference -i "inputs/omni/*.json" -o outputs/ --checkpoint-path Cosmos3-Nano --seed=0All available flags and their current defaults:
python -m cosmos_framework.scripts.inference --helpThis section is a catch-all for tips that don't fit elsewhere. Add new entries freely.
They illustrate how the inference logic works under the hood — examples/inference.py shows the low-level model API and examples/inference_pipeline.py shows the pipeline API. For production use, prefer python -m cosmos_framework.scripts.inference.
cosmos_framework/inference/ray/configs/latency.yaml and cosmos_framework/inference/ray/configs/throughput.yaml. These configure the Ray Serve deployment with different parallelism strategies.