Anyone managed to run training on an NVIDIA Spark yet? #28

simonw · 2025-10-14T09:57:47Z

simonw
Oct 14, 2025

The new Spark has 120GB of available RAM which makes things harder since the training scripts are currently setup to expect 8xH100 which is 640GB.

I turned off parallel GPUs and got an out of memory error, presumably I need to reduce batch sizes or similar.

Anyone else trying to get this to work on the new Spark machines?

JonathanFly · 2025-10-14T10:46:49Z

JonathanFly
Oct 14, 2025

With 120GB the original device_batch_size=32 should work, I think, because it's per device?

On a single 4090 24GB I'm training with device_batch_size=4, and it's using 17 / 24 GB. First I tried with torchrun and nproc_per_node=1 then I restarted running python base_train.py directly and that seemed to use less memory. But still only batch 4 worked, even though 24GB is more than 1/4 of 80GB.

4 replies

jasonacox Oct 19, 2025

I'm seeing the same with a 3090 24GB. It works with a device_batch_size=4, but slow.

Since the Spark allocates 96GB (of the 128G unified memory) to the GPU, I would hope for 4x the 3090, so set device_batch_size=16.

python -m scripts.base_train --depth=20 --device_batch_size=16

huide9 Oct 21, 2025

any luck to get the training done on your single 3090 / 4090 gpu?

FlorinAndrei Oct 22, 2025

I did a bit of spot-checking. device_batch_size=4 is indeed about 17 GB. Extrapolating from a few hours, the training stage would complete in about 7 days on my RTX 3090, using the $100 version of the LLM.

huide9 Oct 22, 2025

Awesome @FlorinAndrei , this encouraged me to keep working on my old HW. :)

slevental · 2025-10-15T00:21:55Z

slevental
Oct 15, 2025

No, but I'm trying to train on 2 x Nvidia RTX PRO 6000s, with the same batch size as in the script (32), it takes about 20h for the pre-training step to finish (still training);

7 replies

vgoklani Oct 15, 2025

I have 4 Max-Q cards, so i was curious to see how they compare to yours... did you update the MFU to use your card, or are you using the default for the H100

step 21397/21400 (99.99%) | loss: 2.759622 | lrm: 0.00 | dt: 2634.03ms | tok/sec: 99,522 | mfu: 17.57 | total time: 937.71m
step 21398/21400 (99.99%) | loss: 2.759133 | lrm: 0.00 | dt: 2635.32ms | tok/sec: 99,473 | mfu: 17.56 | total time: 937.76m
step 21399/21400 (100.00%) | loss: 2.772930 | lrm: 0.00 | dt: 2639.75ms | tok/sec: 99,306 | mfu: 17.53 | total time: 937.80m

slevental Oct 15, 2025

Ah I see, I didn't update the peak performance for the Blackwell RTX card, if it's set to 503TFLOPS (MaxQ 438.9TFLOPS and 2x with sparcity):

Here is what I'm getting how:

step 15675/21400 (73.25%) | loss: 2.676067 | lrm: 1.00 | dt: 3607.38ms | tok/sec: 36,334 | mfu: 50.45 | total time: 942.85m
step 15676/21400 (73.25%) | loss: 2.685829 | lrm: 1.00 | dt: 3607.99ms | tok/sec: 36,328 | mfu: 50.44 | total time: 942.91m
step 15677/21400 (73.26%) | loss: 2.681415 | lrm: 1.00 | dt: 3606.50ms | tok/sec: 36,343 | mfu: 50.46 | total time: 942.97m
step 15678/21400 (73.26%) | loss: 2.677648 | lrm: 1.00 | dt: 3607.58ms | tok/sec: 36,332 | mfu: 50.44 | total time: 943.03m
step 15679/21400 (73.27%) | loss: 2.674938 | lrm: 1.00 | dt: 3608.55ms | tok/sec: 36,322 | mfu: 50.43 | total time: 943.10m
step 15680/21400 (73.27%) | loss: 2.675389 | lrm: 1.00 | dt: 3610.64ms | tok/sec: 36,301 | mfu: 50.40 | total time: 943.16m
step 15681/21400 (73.28%) | loss: 2.680547 | lrm: 1.00 | dt: 3613.40ms | tok/sec: 36,273 | mfu: 50.36 | total time: 943.22m
step 15682/21400 (73.28%) | loss: 2.681142 | lrm: 1.00 | dt: 3614.26ms | tok/sec: 36,265 | mfu: 50.35 | total time: 943.28m
step 15683/21400 (73.29%) | loss: 2.685954 | lrm: 1.00 | dt: 3617.14ms | tok/sec: 36,236 | mfu: 50.31 | total time: 943.34m
step 15684/21400 (73.29%) | loss: 2.693306 | lrm: 1.00 | dt: 3620.15ms | tok/sec: 36,206 | mfu: 50.27 | total time: 943.40m
step 15685/21400 (73.29%) | loss: 2.694197 | lrm: 1.00 | dt: 3617.53ms | tok/sec: 36,232 | mfu: 50.30 | total time: 943.46m

slevental Oct 15, 2025

@vgoklani 99,522 tokens/second is much higher, curious what I'm doing wrong.. did you increase device_batch_size beyond 32

slevental Oct 20, 2025

After updating CUDA to 13.0, the training speed grew from 36k to 40k tokens per second on average.
Also, I've noticed that my average GPU clock speed is throttled to ~2GHz instead of 2.8GHz, assuming that, thermally, my setup is not as efficient as it could be.

slevental Oct 22, 2025

BTW: 96GB fits 40x2048 microbatch (instead of default 32), this improves speed even more:

czardoz · 2025-10-19T23:50:28Z

czardoz
Oct 19, 2025

I'm seeing pretty abysmal tok/sec on my DGX spark:

step 00000/21400 (0.00%) | loss: 11.090355 | lrm: 1.00 | dt: 147215.12ms | tok/sec: 445 | mfu: 1.26 | total time: 0.00m

I haven't tuned anything tbh, just upgraded torch to 2.9, and set --nproc_per_node to 1. Any thoughts on speeding this up?

13 replies

czardoz Oct 22, 2025

Ah perfect. The docker image indeed helped

2025-10-22 05:32:41
step 01242/21400 (5.80%) | loss: 3.285589 | lrm: 1.00 | dt: 41862.73ms | tok/sec: 1,565 | mfu: 4.42 | total time: 858.84m

lakamsani Oct 23, 2025

potential option without docker here. https://forums.developer.nvidia.com/t/anyone-got-nanochat-training-working-on-the-dgx-spark/348537/8

afalk42 Oct 23, 2025

I also posted it in this thread beow: #28 (comment)

jasonacox Oct 27, 2025

@emaadmanzoor I couldn't get the container trick to work. Did you just follow the generic steps or use something similar to what @afalk42 posted?

emaadmanzoor Oct 30, 2025

@jasonacox I updated my steps to include some missing information but they work, just replicated them on my second Spark.

FlorinAndrei · 2025-10-22T01:30:19Z

FlorinAndrei
Oct 22, 2025

So, does anyone have a duration estimate for training nanochat (either the $100 version or the $1000 version) on the DGX Spark?

11 replies

FlorinAndrei Oct 22, 2025

@czardoz @emaadmanzoor Thank you! Actual numbers - this is great! It's a little slower than the RTX 3090 - hm, interesting.

What is the batch size you're using, and how much RAM is used by the training process at that batch size?

Would you say the overall training speed changes if you change the batch size?

emaadmanzoor Oct 22, 2025

I use a batch size of 32 and depth of 20, it takes up about 77GB of VRAM and 11GB of host RAM according to nvtop.

I did try a batch size of 64 (too large) and 40 (triggers the assertion on line 88 in base_train.py).

constructomech Oct 29, 2025

This rate works out to 9 days 44 minutes linearly extrapolated.

(edit) I see that my ETA is identical to @afalk42's below which makes sense since his input and instruction got me up and running (thanks, BTW) :-)

afalk42 Oct 29, 2025

zussini Oct 29, 2025

these are mine for rtx4090:
torchrun --standalone --nproc_per_node=1 -m scripts.base_train -- --depth=20
with batch size = 4 and it looks like it will take around 3 days:
step 04245/21400 (19.84%) | loss: 3.012123 | lrm: 1.00 | dt: 13776.73ms | tok/sec: 594 | mfu: 13.44 | total time: 971.74m
I probably need still to follow some guidance around here to use docker:
https://build.nvidia.com/spark/pytorch-fine-tune/instructions
I did follow these instructions/100$ version:
#1

afalk42 · 2025-10-23T01:24:38Z

afalk42
Oct 23, 2025

I have managed to get it running on my DGX Spark just now natively (without resorting to a docker container), but it took a couple of hours of trial and error, going through documentation and forums, and consulting with GPT-5 to figure out all the steps necessary...

This is the approach that worked for me:

Clone repo and make modifications

Get the repo and change into the project directory:

git clone https://github.com/karpathy/nanochat.git
cd nanochat

Update requirements and switch to CUDA 13.0

I found it necessary to increase the dependency requirements for torch to 2.9.0 and for triton to 3.5.0 and to switch from CUDA 12.8 to 13.0. To do that, you need to change pyproject.toml as follows:

[project]
name = "nanochat"
version = "0.1.0"
description = "the minimal full-stack ChatGPT clone"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "datasets>=4.0.0",
    "fastapi>=0.117.1",
    "files-to-prompt>=0.6",
    "numpy==1.26.4",
    "psutil>=7.1.0",
    "regex>=2025.9.1",
    "setuptools>=80.9.0",
    "tiktoken>=0.11.0",
    "tokenizers>=0.22.0",
    "torch>=2.9.0",
    "triton>=3.5.0",
    "uvicorn>=0.36.0",
    "wandb>=0.21.3",
]

[build-system]
requires = ["maturin>=1.7,<2.0"]
build-backend = "maturin"

[tool.maturin]
module-name = "rustbpe"
bindings = "pyo3"
python-source = "."
manifest-path = "rustbpe/Cargo.toml"

[dependency-groups]
dev = [
    "maturin>=1.9.4",
    "pytest>=8.0.0",
]

[tool.pytest.ini_options]
markers = [
    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]

# target torch to cuda 13.0 or CPU
[tool.uv.sources]
torch = [
    { index = "pytorch-cpu", extra = "cpu" },
    { index = "pytorch-cu130", extra = "gpu" },
]

[[tool.uv.index]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
explicit = true

[[tool.uv.index]]
name = "pytorch-cu130"
url = "https://download.pytorch.org/whl/cu130"
explicit = true

[project.optional-dependencies]
cpu = [
    "torch>=2.9.0",
]
gpu = [
    "torch>=2.9.0",
]

[tool.uv]
conflicts = [
    [
        { extra = "cpu" },
        { extra = "gpu" },
    ],
]

Install UV, install repo dependencies, activate venv

Now you're ready to continue following the installation instructions for nanochat as per #1 in particular the following steps to install UV, install all repo dependencies, and activate the venv:

# install uv (if not already installed)
command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
# create a .venv local virtual environment (if it doesn't exist)
[ -d ".venv" ] || uv venv
# install the repo dependencies
uv sync
# activate venv so that `python` uses the project's venv instead of system python
source .venv/bin/activate

Build and train the tokenizer

You can continue to follow the instructions to build the tokenizer:

# Install Rust / Cargo
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
# Build the rustbpe Tokenizer
uv run maturin develop --release --manifest-path rustbpe/Cargo.toml

To download the training dataset:

python -m nanochat.dataset -n 240

And to train the tokenizer and evaluate it:

python -m scripts.tok_train --max_chars=2000000000
python -m scripts.tok_eval

If you haven't already done so previously, you also should download the eval bundle at this time:

curl -L -o eval_bundle.zip https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip
unzip -q eval_bundle.zip
rm eval_bundle.zip
mv eval_bundle "$HOME/.cache/nanochat"

Install CUDA 13.0.2

The next step in the nanochat instructions would be to now run pre-training, but that step will fail, because the default ptxas installed with Triton 3.5.0 is the CUDA 12.8 version and doesn't know about the sm_121a gpu-name of the Blackwell GB10.

At this time, you need to go to the nVIDIA Developer website and install CUDA 13.0.2 manually by following the steps here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=arm64-sbsa&Compilation=Native&Distribution=Ubuntu&target_version=24.04&target_type=deb_local

In particular, this was the sequence that worked for me on the DGX Spark:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda-repo-ubuntu2404-13-0-local_13.0.2-580.95.05-1_arm64.deb
sudo dpkg -i cuda-repo-ubuntu2404-13-0-local_13.0.2-580.95.05-1_arm64.deb
sudo cp /var/cuda-repo-ubuntu2404-13-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0

And now you need to tell Triton to use the new ptxas version you just installed with the CUDA 13.0.2 toolkit:

# assuming CUDA 13.0 is installed at /usr/local/cuda-13.0
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH}

Run pre-training

Now you should be able to run pre-training on your GDX Spark with the usual command from the nanochat instructions:

torchrun --standalone --nproc_per_node=gpu -m scripts.base_train -- --depth=20

That's what did the trick for me and here is the result of nanochat running on my DGX Spark:

                                                   █████                 █████
                                                  ░░███                 ░░███
 ████████    ██████   ████████    ██████   ██████  ░███████    ██████   ███████
░░███░░███  ░░░░░███ ░░███░░███  ███░░███ ███░░███ ░███░░███  ░░░░░███ ░░░███░
 ░███ ░███   ███████  ░███ ░███ ░███ ░███░███ ░░░  ░███ ░███   ███████   ░███
 ░███ ░███  ███░░███  ░███ ░███ ░███ ░███░███  ███ ░███ ░███  ███░░███   ░███ ███
 ████ █████░░████████ ████ █████░░██████ ░░██████  ████ █████░░████████  ░░█████
░░░░ ░░░░░  ░░░░░░░░ ░░░░ ░░░░░  ░░░░░░   ░░░░░░  ░░░░ ░░░░░  ░░░░░░░░    ░░░░░

Overriding: depth = 20
Autodetected device type: cuda
/home/alf/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
/home/alf/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning:
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)

  warnings.warn(
2025-10-22 21:04:58,661 - nanochat.common - INFO - Distributed world size: 1
Vocab size: 65,536
num_layers: 20
model_dim: 1280
num_heads: 10
num_kv_heads: 10
Tokens / micro-batch / rank: 32 x 2048 = 65,536
Tokens / micro-batch: 65,536
Total batch size 524,288 => gradient accumulation steps: 8
Number of parameters: 560,988,160
Estimated FLOPs per token: 3.491758e+09
Calculated number of iterations from target data:param ratio: 21,400
Total number of training tokens: 11,219,763,200
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 3.917670e+19
Scaling the LR for the AdamW parameters ∝1/√(1280/768) = 0.774597
Muon: Grouping 80 params of shape torch.Size([1280, 1280]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([1280, 5120]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([5120, 1280]), device cuda:0, dtype torch.float32
Step 00000 | Validation bpb: 3.3015
step 00000/21400 (0.00%) | loss: 11.090355 | lrm: 1.00 | dt: 42376.54ms | tok/sec: 1,546 | mfu: 4.37 | total time: 0.00m
step 00001/21400 (0.00%) | loss: 10.817723 | lrm: 1.00 | dt: 39542.98ms | tok/sec: 1,657 | mfu: 4.68 | total time: 0.00m
step 00002/21400 (0.01%) | loss: 10.198821 | lrm: 1.00 | dt: 39778.76ms | tok/sec: 1,647 | mfu: 4.65 | total time: 0.00m
step 00003/21400 (0.01%) | loss: 9.490393 | lrm: 1.00 | dt: 39976.27ms | tok/sec: 1,639 | mfu: 4.63 | total time: 0.00m

As you can see in the above output, it still shows a brief warning that the GB10 has even more cuda capabilities than what PyTorch currently supports, but the training is now able to correctly run on the DGX Spark.

Btw, I expect the pre-training to run for a few days, so I actually did all of the above in a screen session to ensure the job wouldn't terminate if my ssh connection died for some reason. This will allow me to simply reconnect ssh and then reattach with screen -r jobname.

7 replies

VeeDuvv Oct 25, 2025

Been struggling with CUDA version and finally got pretraining to work now (with lower depth). Will try out the rest of your steps and hopefully report success before the weekend is over :-)

jasonacox Oct 27, 2025

Thanks for this @afalk42 - this did the trick! 🙏 Key for me was the upgrade to CUDA 13 and I'm now getting over 1,600 tok/s (better than my 3090s) all while using only 100W of power. Incredible!

I documented my steps here: https://github.com/jasonacox/dgx-spark/tree/main/nanochat with some helper scripts.

lakamsani Oct 27, 2025

@jasonacox
Suggest you credit by linking to this discussion somewhere in your repo? Maybe in here.

https://github.com/jasonacox/dgx-spark/blob/main/nanochat/setup.sh

jasonacox Oct 27, 2025

Yes! Thanks @lakamsani 🙏

afalk42 Nov 2, 2025

Mine just finished pre-training after 14,137.8 minutes = 235.63 hours = 9.8 days (pretty much as expected).

Here are the final lines from the log:

step 21396/21400 (99.98%) | loss: 2.722207 | lrm: 0.00 | dt: 39602.13ms | tok/sec: 1,654 | mfu: 4.67 | total time: 14135.83m
step 21397/21400 (99.99%) | loss: 2.728307 | lrm: 0.00 | dt: 39563.34ms | tok/sec: 1,656 | mfu: 4.68 | total time: 14136.49m
step 21398/21400 (99.99%) | loss: 2.731212 | lrm: 0.00 | dt: 39425.02ms | tok/sec: 1,662 | mfu: 4.70 | total time: 14137.14m
step 21399/21400 (100.00%) | loss: 2.726305 | lrm: 0.00 | dt: 39499.76ms | tok/sec: 1,659 | mfu: 4.69 | total time: 14137.80m
Step 21400 | Validation bpb: 0.8139
Evaluating: hellaswag_zeroshot (0-shot, type: multiple_choice)... accuracy: 0.4380 | centered: 0.2507 | time: 10.68s
Evaluating: jeopardy (10-shot, type: language_modeling)... accuracy: 0.1140 | centered: 0.1140 | time: 9.90s
Evaluating: bigbench_qa_wikidata (10-shot, type: language_modeling)... accuracy: 0.5200 | centered: 0.5200 | time: 6.75s
Evaluating: arc_easy (10-shot, type: multiple_choice)... accuracy: 0.6720 | centered: 0.5627 | time: 30.53s
Evaluating: arc_challenge (10-shot, type: multiple_choice)... accuracy: 0.3460 | centered: 0.1280 | time: 35.31s
Evaluating: copa (0-shot, type: multiple_choice)... accuracy: 0.5900 | centered: 0.1800 | time: 1.21s
Evaluating: commonsense_qa (10-shot, type: multiple_choice)... accuracy: 0.3280 | centered: 0.1600 | time: 40.96s
Evaluating: piqa (10-shot, type: multiple_choice)... accuracy: 0.6840 | centered: 0.3680 | time: 17.55s
Evaluating: openbook_qa (0-shot, type: multiple_choice)... accuracy: 0.3520 | centered: 0.1360 | time: 6.16s
Evaluating: lambada_openai (0-shot, type: language_modeling)... accuracy: 0.4000 | centered: 0.4000 | time: 6.22s
Evaluating: hellaswag (10-shot, type: multiple_choice)... accuracy: 0.4420 | centered: 0.2560 | time: 75.70s
Evaluating: winograd (0-shot, type: schema)... accuracy: 0.6447 | centered: 0.2894 | time: 3.33s
Evaluating: winogrande (0-shot, type: schema)... accuracy: 0.5280 | centered: 0.0560 | time: 6.11s
Evaluating: bigbench_dyck_languages (10-shot, type: language_modeling)... accuracy: 0.1340 | centered: 0.1340 | time: 15.07s
Evaluating: agi_eval_lsat_ar (3-shot, type: multiple_choice)... accuracy: 0.2348 | centered: 0.0435 | time: 32.92s
Evaluating: bigbench_cs_algorithms (10-shot, type: language_modeling)... accuracy: 0.4220 | centered: 0.4220 | time: 13.05s
Evaluating: bigbench_operators (10-shot, type: language_modeling)... accuracy: 0.1714 | centered: 0.1714 | time: 4.90s
Evaluating: bigbench_repeat_copy_logic (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 0.78s
Evaluating: squad (10-shot, type: language_modeling)... accuracy: 0.2400 | centered: 0.2400 | time: 44.38s
Evaluating: coqa (0-shot, type: language_modeling)... accuracy: 0.1900 | centered: 0.1900 | time: 12.80s
Evaluating: boolq (10-shot, type: multiple_choice)... accuracy: 0.4880 | centered: -0.3474 | time: 68.17s
Evaluating: bigbench_language_identification (10-shot, type: multiple_choice)... accuracy: 0.2320 | centered: 0.1551 | time: 125.85s
Step 21400 | CORE metric: 0.2013
<|bos|>The capital of France is Paris, the capital of France is Paris, and the capital of France is Paris
<|bos|>The chemical symbol of gold is Au. The chemical symbol of gold is Au. The chemical symbol of gold is
<|bos|>If yesterday was Friday, then tomorrow will be Saturday. If today is Monday, then tomorrow will be Tuesday. If today is
<|bos|>The opposite of hot is cold. The opposite of cold is hot. The opposite of hot is cold.
<|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,
<|bos|>My favorite color is blue. It’s the color of the sky, the ocean, and the sky
<|bos|>If 5*x + 3 = 13, then x is a solution to the equation 5*x + 3 = 13.
2025-11-02 00:17:25,063 - nanochat.checkpoint_manager - INFO - Saved model file to: /home/alf/.cache/nanochat/base_checkpoints/d20/model_021400.pt
2025-11-02 00:17:27,551 - nanochat.checkpoint_manager - INFO - Saved optimizer file to: /home/alf/.cache/nanochat/base_checkpoints/d20/optim_021400.pt
2025-11-02 00:17:27,551 - nanochat.checkpoint_manager - INFO - Saved metadata file to: /home/alf/.cache/nanochat/base_checkpoints/d20/meta_021400.json
Peak memory usage: 74841.77MiB
Total training time: 14137.80m
Minimum validation bpb: 0.8139

Onward to mid-training as the next step...

dgxspark2025 · 2025-10-24T05:58:15Z

dgxspark2025
Oct 24, 2025

@afalk42 thanks for sharing your steps! Curious why in command torchrun --standalone --nproc_per_node=gpu -m scripts.base_train -- --depth=20, use gpu as the value for --nproc_per_node=gpu?

4 replies

afalk42 Oct 24, 2025

The original script and instructions used a hard-coded value of ---nproc_per_node=8, because it was targeting a specific cloud instance with 8 GPUs. I wanted to make it more generic and just tell it to use however many GPUs it has, and the documentation says you can just say --nproc_per_node=gpu to use the # of GPUs available.

lakamsani Oct 24, 2025

Hi @dgxspark2025 on my DGC Spark, I had to do uv sync --extra gpu in addition to the excellent solution by @afalk42 before --nproc_per_node=gpu worked . It was not using the GPU at all if I tried --nproc_per_node=1, instead used CPU.

FlorinAndrei Oct 24, 2025

the documentation says you can just say --nproc_per_node=gpu to use the # of GPUs available.

That sounds like an easy and obvious change to the repo.

dgxspark2025 Oct 25, 2025

Thanks folks, I can run it successfully now :)

nitishpandey04 · 2025-10-29T09:52:22Z

nitishpandey04
Oct 29, 2025

dgx spark is primarily built for inference workloads because it has low memory bandwidth and high memory. that said, i am going to be optimistic and try running nanochat on my laptop which has a 3050 ti mobile with 4 gb vram xD. Let's see, will update.

1 reply

lakamsani Oct 29, 2025

I read that it's more for fine tuning and the overall integration with the Nvidia ecosystem.

On Inference it's fast enough for prefill phase of input prompt tokens but slow for decode or output phase.

https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/

My experience with local ollama models via WebUI on my dgxspark matches the numbers in that blog so far.

On training here's where my dgxspark is at with nanochat 🙂. Looks like roughly three more days to go after running for around 9300 minutes so far.

emaadmanzoor · 2025-10-29T17:24:33Z

emaadmanzoor
Oct 29, 2025

I suggest the non-DGX Spark threads be moved to their own discussions, instead of cluttering this one.

0 replies

bhaweshkrsingh · 2025-10-30T03:52:28Z

bhaweshkrsingh
Oct 30, 2025

Thank you for the detailed steps. I am able to run it on my dgx spark., now will wait for ~9 days for this to finish. :)

0 replies

emaadmanzoor · 2025-10-30T04:07:51Z

emaadmanzoor
Oct 30, 2025

Excited to report early numbers from a multinode pretraining run on 2 DGX Sparks linked with a 200 Gbps interconnect. The depth is still 20 and batch size per device is 32, which uses about 90GB of unified memory per device. The reported tokens/sec increases from 1,600 to 6,600. I extrapolate a completion time of 4 5 days, compared to 10 days with a single DGX Spark.

I expect an additional 4-6x speedup once I have my NVFP4 pretraining pipeline working, which should bring the total pretraining time down to roughly 1 day. The real question will be how much the quality degrades.

For reference, I get about 12,000 tokens/sec on my single H100 server with an extrapolated completion time of 1 day. Andrej reports 1.1 million tokens/sec on 8 x H100s and a completion time of 3 hours.

Overriding: depth = 20
Overriding: device_batch_size = 32
Autodetected device type: cuda
/usr/local/lib/python3.12/dist-packages/torch/__init__.py:1605: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision)
[W1030 03:40:12.806240208 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
2025-10-30 03:40:12,487 - nanochat.common - INFO - Distributed world size: 2
DDP environment: MASTER_ADDR=169.254.69.64, MASTER_PORT=1235, WORLD_SIZE=2, RANK=0, LOCAL_RANK=0, NODE_RANK=0, GROUP_RANK=0
Participating hosts (2 nodes): spark-28a3 x1, spark-7e64 x1
Local device: cuda:0 (NVIDIA GB10)
Vocab size: 65,536
num_layers: 20
model_dim: 1280
num_heads: 10
num_kv_heads: 10
Tokens / micro-batch / rank: 32 x 2048 = 65,536
Tokens / micro-batch: 131,072
Total batch size 524,288 => gradient accumulation steps: 4
Number of parameters: 560,988,160
Estimated FLOPs per token: 3.491758e+09
Calculated number of iterations from target data:param ratio: 21,400
Total number of training tokens: 11,219,763,200
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 3.917670e+19
Scaling the LR for the AdamW parameters ∝1/√(1280/768) = 0.774597
Muon: Grouping 80 params of shape torch.Size([1280, 1280]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([1280, 5120]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([5120, 1280]), device cuda:0, dtype torch.float32
[rank0]:W1030 03:40:19.972000 938 torch/_inductor/utils.py:1545] [0/0] Not enough SMs to use max_autotune_gemm_mode
step 00000/21400 (0.00%) | loss: 11.090355 | lrm: 1.00 | dt: 48702.66ms | tok/sec: 2,691 | mfu: 1.90 | total time: 0.00m
step 00001/21400 (0.00%) | loss: 10.845189 | lrm: 1.00 | dt: 19782.66ms | tok/sec: 6,625 | mfu: 4.68 | total time: 0.00m
step 00002/21400 (0.01%) | loss: 10.234355 | lrm: 1.00 | dt: 19913.69ms | tok/sec: 6,582 | mfu: 4.65 | total time: 0.00m
step 00003/21400 (0.01%) | loss: 9.594764 | lrm: 1.00 | dt: 19803.63ms | tok/sec: 6,618 | mfu: 4.67 | total time:0.00m
step 00004/21400 (0.02%) | loss: 9.060378 | lrm: 1.00 | dt: 19871.99ms | tok/sec: 6,595 | mfu: 4.66 | total time: 0.00m
step 00005/21400 (0.02%) | loss: 8.730332 | lrm: 1.00 | dt: 19861.17ms | tok/sec: 6,599 | mfu: 4.66 | total time: 0.00m
step 00006/21400 (0.03%) | loss: 8.456688 | lrm: 1.00 | dt: 19899.85ms | tok/sec: 6,586 | mfu: 4.65 | total time: 0.00m
step 00007/21400 (0.03%) | loss: 8.242034 | lrm: 1.00 | dt: 19890.60ms | tok/sec: 6,589 | mfu: 4.65 | total time: 0.00m
step 00008/21400 (0.04%) | loss: 8.076124 | lrm: 1.00 | dt: 19917.73ms | tok/sec: 6,580 | mfu: 4.65 | total time: 0.00m
step 00009/21400 (0.04%) | loss: 7.921692 | lrm: 1.00 | dt: 19880.36ms | tok/sec: 6,593 | mfu: 4.66 | total time: 0.00m
step 00010/21400 (0.05%) | loss: 7.812728 | lrm: 1.00 | dt: 19960.25ms | tok/sec: 6,566 | mfu: 4.64 | total time: 0.00m
step 00011/21400 (0.05%) | loss: 7.711507 | lrm: 1.00 | dt: 19772.49ms | tok/sec: 6,629 | mfu: 4.68 | total time: 0.33m
step 00012/21400 (0.06%) | loss: 7.612875 | lrm: 1.00 | dt: 19835.45ms | tok/sec: 6,607 | mfu: 4.67 | total time: 0.66m
step 00013/21400 (0.06%) | loss: 7.539874 | lrm: 1.00 | dt: 19861.53ms | tok/sec: 6,599 | mfu: 4.66 | total time: 0.99m
step 00014/21400 (0.07%) | loss: 7.462673 | lrm: 1.00 | dt: 19916.66ms | tok/sec: 6,581 | mfu: 4.65 | total time: 1.32m
step 00015/21400 (0.07%) | loss: 7.395429 | lrm: 1.00 | dt: 19833.75ms | tok/sec: 6,608 | mfu: 4.67 | total time: 1.65m
step 00016/21400 (0.07%) | loss: 7.357192 | lrm: 1.00 | dt: 19877.55ms | tok/sec: 6,593 | mfu: 4.66 | total time: 1.98m
step 00017/21400 (0.08%) | loss: 7.296845 | lrm: 1.00 | dt: 19933.62ms | tok/sec: 6,575 | mfu: 4.64 | total time: 2.32m
step 00018/21400 (0.08%) | loss: 7.244120 | lrm: 1.00 | dt: 20173.28ms | tok/sec: 6,497 | mfu: 4.59 | total time: 2.65m
step 00019/21400 (0.09%) | loss: 7.183550 | lrm: 1.00 | dt: 20093.43ms | tok/sec: 6,523 | mfu: 4.61 | total time: 2.99m
step 00020/21400 (0.09%) | loss: 7.130478 | lrm: 1.00 | dt: 19887.96ms | tok/sec: 6,590 | mfu: 4.65 | total time: 3.32m
step 00021/21400 (0.10%) | loss: 7.112146 | lrm: 1.00 | dt: 20027.87ms | tok/sec: 6,544 | mfu: 4.62 | total time: 3.65m
step 00022/21400 (0.10%) | loss: 7.073102 | lrm: 1.00 | dt: 19919.04ms | tok/sec: 6,580 | mfu: 4.65 | total time: 3.99m

4 replies

FlorinAndrei Oct 30, 2025

I expect an additional 4-6x speedup once I have my NVFP4 pretraining pipeline working

Wow.

Would it be fair to say that the Spark (or maybe Blackwell in general) is, among other things, a bet by NVIDIA that FP4 will become a very popular format?

emaadmanzoor Oct 30, 2025

I think it's been trending this way for a while since DeepSeek came out with fp8 training. gpt-oss was also a big selling point for MXFP4. TransformerEngine right now has a mixed-precision branch with the forward pass in fp4 and the backward pass in fp8, so I am not sure full fp4 training is stable yet.

I like this blog post by Cursor on their MXFP8 training pipeline: https://cursor.com/blog/kernels

z-verglas Nov 1, 2025

How did you get it running on 2 nodes?

emaadmanzoor Nov 4, 2025

I have compiled the steps I took to get pretraining running on 2 DGX Sparks in this gist: https://gist.github.com/emaadmanzoor/d245c0c0ce90b25b4d50c0ffc448f876

alttools · 2025-10-31T17:53:50Z

alttools
Oct 31, 2025

Lurker here, thanks everyone for putting in such good information in this thread! I got my DGX spark yesterday and have been playing with it this morning and have some early success with this project (largely due to this thread).

I think I got NVFP4 working. It's now cooking about 8x faster (1,650 t/s -> 13,000 t/s).

I also changed a couple other things, learnable rmsnorm and swiglu for the activation function and it uses transformer engine fused attention for training.

I used ChatGPT-5 Pro to come up with some optimizations and had it make a "guide for a junior developer" to implement what looked like the biggest impact optimizations. This file is what I used to modify nanochat to use nvfp4, I didn't quantize the lm_head, but all the other linear layers should be using nvfp4 now.
optimizations_0.md

Ram is holding steady at 114 Gb with 20 layers and the normal batch size of 32.

I am going to give it like an hour or two before trying to estimate the total time to do pre-training.

16 replies

jasonacox Nov 1, 2025

AssertionError: MXFP8 (for all gemm layouts) is not supported on 12.0+ architectures yet.

Yes, full support for sm_121 is just not out there. I wonder how close sm_121 could follow sm_120 if we just update assertions.

bhaweshkrsingh Nov 3, 2025

Lurker here, thanks everyone for putting in such good information in this thread! I got my DGX spark yesterday and have been playing with it this morning and have some early success with this project (largely due to this thread).

I think I got NVFP4 working. It's now cooking about 8x faster (1,650 t/s -> 13,000 t/s).

I also changed a couple other things, learnable rmsnorm and swiglu for the activation function and it uses transformer engine fused attention for training.

I used ChatGPT-5 Pro to come up with some optimizations and had it make a "guide for a junior developer" to implement what looked like the biggest impact optimizations. This file is what I used to modify nanochat to use nvfp4, I didn't quantize the lm_head, but all the other linear layers should be using nvfp4 now. optimizations_0.md

Ram is holding steady at 114 Gb with 20 layers and the normal batch size of 32.

I am going to give it like an hour or two before trying to estimate the total time to do pre-training.

bhaweshkrsingh Nov 3, 2025

Ohh... Thanks for trying many things for possible improvement. I was so ready to kill my on-going training after 4 days and start with it fresh with 8x speed but then saw your comment that the actual speed didn't improve. Well, someone in this thread will figure it out. I am also, going to take AI's help to see if we can improve the speed.

alttools Nov 3, 2025

This is my quick test to see if nvfp4 will work. While training I got impatient and tried to recompile transformer engine with different options and wanted to test before stopping training. Until I can get this little test to run I'll be keeping my training going.

import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import NVFP4BlockScaling

def main():
    # Basic environment printouts
    print("torch:", torch.__version__, "cuda:", torch.version.cuda)
    print("cuda available:", torch.cuda.is_available())
    if not torch.cuda.is_available():
        raise SystemError("CUDA not available")
    props = torch.cuda.get_device_properties(0)
    print(f"device: {props.name} cc: {props.major}.{props.minor}")

    # NVFP4 recipe (Blackwell FP4 with block scaling)
    nvfp4_recipe = NVFP4BlockScaling()  # TE >= 2.8 adds this

    # Tiny shapes; both dims divisible by 16 to hit low-precision kernels
    in_features = out_features = 1024
    batch = 16  # NVFP4 requires the leading dimension to align to 16

    # Layer + toy data
    # NVFP4 RHT path expects BF16 inputs/weights outside autocast
    params_dtype = torch.bfloat16
    layer = te.Linear(
        in_features,
        out_features,
        bias=True,
        params_dtype=params_dtype,
    ).cuda()
    opt = torch.optim.AdamW(layer.parameters(), lr=1e-3)
    x = torch.randn(batch, in_features, device="cuda", dtype=params_dtype)
    target = torch.randn(batch, out_features, device="cuda", dtype=params_dtype)

    # Forward under NVFP4; backward/step outside the autocast (per TE docs)
    with te.fp8_autocast(enabled=True, fp8_recipe=nvfp4_recipe):
        y = layer(x)

    loss = torch.nn.functional.mse_loss(y, target)
    opt.zero_grad(set_to_none=True)
    loss.backward()
    opt.step()
    torch.cuda.synchronize()

    print("NVFP4 smoke test: SUCCESS | loss =", float(loss))

if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        print("\nNVFP4 smoke test: FAILED\n")
        import traceback; traceback.print_exc()
        print(
            "\nQuick hints:\n"
            "- Requires Transformer Engine >= 2.8 (NVFP4 recipe added there).\n"
            "- Must run on Blackwell (GB10/SM_100) with a CUDA 13.x PyTorch build.\n"
            "- Keep Linear dims multiples of 16.\n"
        )

I still get :
/home/username/Dev/TransformerEngine/transformer_engine/common/cast/dispatch/../mxfp8/../../util/ptx.cuh:509 in function mul_cvt_bf16_to_fp4_4x_with_stochastic_rounding (thread (31,0,0), block (0,2,0)): FP4 cvt PTX instructions are architecture-specific. Try recompiling with sm_XXXa instead of sm_XXX.

One day...

pangwenfeng Nov 19, 2025

This is my quick test to see if nvfp4 will work. While training I got impatient and tried to recompile transformer engine with different options and wanted to test before stopping training. Until I can get this little test to run I'll be keeping my training going.

import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import NVFP4BlockScaling

def main():
    # Basic environment printouts
    print("torch:", torch.__version__, "cuda:", torch.version.cuda)
    print("cuda available:", torch.cuda.is_available())
    if not torch.cuda.is_available():
        raise SystemError("CUDA not available")
    props = torch.cuda.get_device_properties(0)
    print(f"device: {props.name} cc: {props.major}.{props.minor}")

    # NVFP4 recipe (Blackwell FP4 with block scaling)
    nvfp4_recipe = NVFP4BlockScaling()  # TE >= 2.8 adds this

    # Tiny shapes; both dims divisible by 16 to hit low-precision kernels
    in_features = out_features = 1024
    batch = 16  # NVFP4 requires the leading dimension to align to 16

    # Layer + toy data
    # NVFP4 RHT path expects BF16 inputs/weights outside autocast
    params_dtype = torch.bfloat16
    layer = te.Linear(
        in_features,
        out_features,
        bias=True,
        params_dtype=params_dtype,
    ).cuda()
    opt = torch.optim.AdamW(layer.parameters(), lr=1e-3)
    x = torch.randn(batch, in_features, device="cuda", dtype=params_dtype)
    target = torch.randn(batch, out_features, device="cuda", dtype=params_dtype)

    # Forward under NVFP4; backward/step outside the autocast (per TE docs)
    with te.fp8_autocast(enabled=True, fp8_recipe=nvfp4_recipe):
        y = layer(x)

    loss = torch.nn.functional.mse_loss(y, target)
    opt.zero_grad(set_to_none=True)
    loss.backward()
    opt.step()
    torch.cuda.synchronize()

    print("NVFP4 smoke test: SUCCESS | loss =", float(loss))

if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        print("\nNVFP4 smoke test: FAILED\n")
        import traceback; traceback.print_exc()
        print(
            "\nQuick hints:\n"
            "- Requires Transformer Engine >= 2.8 (NVFP4 recipe added there).\n"
            "- Must run on Blackwell (GB10/SM_100) with a CUDA 13.x PyTorch build.\n"
            "- Keep Linear dims multiples of 16.\n"
        )

I still get : /home/username/Dev/TransformerEngine/transformer_engine/common/cast/dispatch/../mxfp8/../../util/ptx.cuh:509 in function mul_cvt_bf16_to_fp4_4x_with_stochastic_rounding (thread (31,0,0), block (0,2,0)): FP4 cvt PTX instructions are architecture-specific. Try recompiling with sm_XXXa instead of sm_XXX.

One day...

hello, do you have any progress?

lakamsani · 2025-11-04T21:44:12Z

lakamsani
Nov 4, 2025

Mine finished pre-training a couple of days ago.

2025-11-02 22:11:34
step 21399/21400 (100.00%) | loss: 2.727740 | lrm: 0.00 | dt: 42736.21ms | tok/sec: 1,533 | mfu: 4.33 | total time: 15141.69m
2025-11-02 22:15:46
Step 21400 | Validation bpb: 0.8142
2025-11-02 22:15:46
Evaluating: hellaswag_zeroshot (0-shot, type: multiple_choice)... accuracy: 0.4420 | centered: 0.2560 | time: 10.34s
2025-11-02 22:15:56
Evaluating: jeopardy (10-shot, type: language_modeling)... accuracy: 0.1160 | centered: 0.1160 | time: 10.00s
2025-11-02 22:16:06
Evaluating: bigbench_qa_wikidata (10-shot, type: language_modeling)... accuracy: 0.5400 | centered: 0.5400 | time: 6.79s
2025-11-02 22:21:09
Evaluating: bigbench_repeat_copy_logic (10-shot, type: language_modeling)... accuracy: 0.0000 | centered: 0.0000 | time: 0.80s
2025-11-02 22:21:10
Evaluating: squad (10-shot, type: language_modeling)... accuracy: 0.2300 | centered: 0.2300 | time: 46.34s
2025-11-02 22:21:56
Evaluating: coqa (0-shot, type: language_modeling)... accuracy: 0.2020 | centered: 0.2020 | time: 12.65s
2025-11-02 22:22:09
Evaluating: boolq (10-shot, type: multiple_choice)... accuracy: 0.6160 | centered: -0.0105 | time: 71.46s
2025-11-02 22:23:20
Evaluating: bigbench_language_identification (10-shot, type: multiple_choice)... accuracy: 0.2540 | centered: 0.1793 | time: 133.57s
2025-11-02 22:25:34
Step 21400 | CORE metric: 0.2227
2025-11-02 22:25:34
<|bos|>The capital of France is Paris. It is the largest city in France and the second largest city in Europe
2025-11-02 22:25:34
<|bos|>The chemical symbol of gold is Au. The atomic number of gold is 79. The atomic weight of gold
2025-11-02 22:25:34
<|bos|>If yesterday was Friday, then tomorrow will be tomorrow. That’s the way it is in the world of astronomy. The sun
2025-11-02 22:25:34
<|bos|>The opposite of hot is cold. The opposite of cold is hot. The opposite of hot is cold.
2025-11-02 22:25:34
<|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,
2025-11-02 22:25:34
<|bos|>My favorite color is green. I love the color green. I love the color green. I love
2025-11-02 22:25:35
<|bos|>If 5*x + 3 = 13, then x is a multiple of 5.
2025-11-02 22:25:35
If 5*x + 3 =
2025-11-02 22:25:37
2025-11-02 14:25:37,323 - nanochat.checkpoint_manager - INFO - Saved model file to: /home/vamsee/.cache/nanochat/base_checkpoints/d20/model_021400.pt
2025-11-02 22:25:39
2025-11-02 14:25:39,878 - nanochat.checkpoint_manager - INFO - Saved optimizer file to: /home/vamsee/.cache/nanochat/base_checkpoints/d20/optim_021400.pt
2025-11-02 22:25:39
2025-11-02 14:25:39,879 - nanochat.checkpoint_manager - INFO - Saved metadata file to: /home/vamsee/.cache/nanochat/base_checkpoints/d20/meta_021400.json
2025-11-02 22:25:39
Peak memory usage: 74816.77MiB
2025-11-02 22:25:39
Total training time: 15141.69m
2025-11-02 22:25:39
Minimum validation bpb: 0.8142

Moved a few steps forward but it is erroring out at this step

torchrun --standalone --nproc_per_node=gpu -m scripts.mid_train -- --run=$WANDB_RUN

Internal Triton PTX codegen error
`ptxas` stderr:
ptxas fatal   : Value 'sm_121a' is not defined for option 'gpu-name'

Repro command: /home/vamsee/nanochat/.venv/lib/python3.10/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp
/tmpsihy9j58.ptx -o /tmp/tmpsihy9j58.ptx.o

Going to seek help with some coding assistants. Just posting here incase someone already seen this and solved it. Thanks

9 replies

jasonacox Nov 6, 2025

Great job！Did you use fp4 or fp8 for training?

It looks like it is bfloat16 (BF16) - I sure want to try mxfp4, fp4 or fp8. If anyone gets the training scripts updated for that, please let us know.

jasonacox Nov 6, 2025

Running Supervised Fine-tuning (SFT):

Autodetected device type: cuda
/home/jason/Code/dgx-spark/nanochat/nanochat/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
/home/jason/Code/dgx-spark/nanochat/nanochat/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
2025-11-05 21:24:38,697 - nanochat.common - INFO - Distributed world size: 1
2025-11-05 21:24:38,698 - nanochat.checkpoint_manager - INFO - No model tag provided, guessing model tag: d20
2025-11-05 21:24:38,698 - nanochat.checkpoint_manager - INFO - Loading model from /home/jason/.cache/nanochat/mid_checkpoints/d20 with step 813
2025-11-05 21:24:39,327 - nanochat.checkpoint_manager - INFO - Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}
README.md: 9.00kB [00:00, 57.9MB/s]
ARC-Easy/train-00000-of-00001.parquet: 100%|████████████████| 331k/331k [00:01<00:00, 169kB/s]
ARC-Easy/test-00000-of-00001.parquet: 100%|████████████████| 346k/346k [00:00<00:00, 1.20MB/s]
ARC-Easy/validation-00000-of-00001.parqu(…): 100%|████████| 86.1k/86.1k [00:00<00:00, 351kB/s]
Generating train split: 100%|██████████████████| 2251/2251 [00:00<00:00, 385048.05 examples/s]
Generating test split: 100%|███████████████████| 2376/2376 [00:00<00:00, 638415.52 examples/s]
Generating validation split: 100%|███████████████| 570/570 [00:00<00:00, 271954.64 examples/s]
ARC-Challenge/train-00000-of-00001.parqu(…): 100%|██████████| 190k/190k [00:00<00:00, 632kB/s]
ARC-Challenge/test-00000-of-00001.parque(…): 100%|██████████| 204k/204k [00:00<00:00, 770kB/s]
ARC-Challenge/validation-00000-of-00001.(…): 100%|████████| 55.7k/55.7k [00:00<00:00, 240kB/s]
Generating train split: 100%|██████████████████| 1119/1119 [00:00<00:00, 438925.11 examples/s]
Generating test split: 100%|███████████████████| 1172/1172 [00:00<00:00, 629494.72 examples/s]
Generating validation split: 100%|███████████████| 299/299 [00:00<00:00, 253609.08 examples/s]
Target examples per step: 32
Device batch size: 4
Examples per step is device_batch_size * ddp_world_size: 4
=> Setting grad accum steps: 8
Scaling the LR for the AdamW parameters ∝1/√(1280/768) = 0.774597
Muon: Grouping 80 params of shape torch.Size([1280, 1280]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([1280, 5120]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([5120, 1280]), device cuda:0, dtype torch.float32
Step 00000 | Validation loss: 1.002504
Step 00000/00701 | Training loss: 0.599879| lrm: 1.000000| num_tokens: 9,724
Step 00001/00701 | Training loss: 0.705823| lrm: 0.998573| num_tokens: 13,973
Step 00002/00701 | Training loss: 0.979805| lrm: 0.997147| num_tokens: 11,286
Step 00003/00701 | Training loss: 1.237337| lrm: 0.995720| num_tokens: 12,690

lakamsani Nov 6, 2025

Great job！Did you use fp4 or fp8 for training?

It looks like it is bfloat16 (BF16) - I sure want to try mxfp4, fp4 or fp8. If anyone gets the training scripts updated for that, please let us know.

@jasonacox is there anything in the run logs that says it is using BF16

lakamsani Nov 6, 2025

Mid training took around 9 hours, SFT took around an hour and RL around 6 hours.

RL Eval torchrun --standalone --nproc_per_node=gpu -m scripts.chat_eval -- -i rl -a GSM8K looks like this.

Final: 135/1319 (10.24%)
GSM8K accuracy: 10.24%

From the web UI python -m scripts.chat_webI have to prompt it more before it will tell me who its creator is :-)

jasonacox Nov 7, 2025

is there anything in the run logs that says it is using BF16

No, I spotted dtype=torch.bfloat16 auto casting in the code, in all the training scripts. Example:

nanochat/scripts/base_train.py

Lines 69 to 75 in c6b7ab7

    
           # Compute init 
        
           device_type = autodetect_device_type() if device_type == "" else device_type 
        
           ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type) 
        
           master_process = ddp_rank == 0 # this process will do logging, checkpointing etc. 
        
           autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext() 
        
           synchronize = torch.cuda.synchronize if device_type == "cuda" else lambda: None 
        
           get_max_memory = torch.cuda.max_memory_allocated if device_type == "cuda" else lambda: 0

jxlwqq · 2025-11-05T11:52:28Z

jxlwqq
Nov 5, 2025

Since the DGX Spark is based on the Grace Blackwell architecture and natively supports mxfp4/nvfp4, can we leverage this feature to adjust from bfloat16 to mxfp4?

2 replies

jxlwqq Nov 13, 2025

https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/

In this blog post, we share and analyze the impacts of a fine-tuning workflow for gpt-oss that recovers post-training accuracy while retaining the performance benefits of FP4 by:

Performing supervised fine-tuning (SFT) on an upcasted BF16 version of the model.
Applying quantization-aware training (QAT) using NVIDIA TensorRT Model Optimizer.

FlorinAndrei Nov 13, 2025

From the techniques described in the article, NVFP4 quantization is supposed to be particularly effective on the Spark, right? (also on any Blackwell GPU)

yuanlin2004 · 2025-11-13T22:18:44Z

yuanlin2004
Nov 13, 2025

Congrats on successful training on the Spark and thanks for sharing. What's your experience on the thermal performance of the machine during 9+-day-long session.. I find a few heat related reports on NV's DGX Spark Forum and wonder how the heat would affect the job.

3 replies

afalk42 Nov 13, 2025

I'm afraid I don't have a good answer for that, as my DGX Spark sits in the basement in my server room rack rather than under or on my desk...

FlorinAndrei Nov 14, 2025

It's not quite the question you've asked, but this might be relevant to someone out there:

I don't have the original NVIDIA DGX Spark, but the Asus clone of it, the Ascent GX10. I've done a few 24 hour training sessions with it. While training, it is very quiet, I can barely hear the fans if I put my ear close to it. It's just barely warm to the touch. I keep it in the living room at normal temperature.

When it's not training, it's cold and makes no noise.

lakamsani Nov 14, 2025

My dgxspark got really hot during the nanochat training. Confirmed via wandb system temperature chart that showed around 85 degrees celsius.But was very quiet.

A reviewer on YouTube was joking that 🙂 he could make an egg or keep coffee warm. Around the seven minute mark here. He is right!

https://youtu.be/FYL9e_aqZY0?si=FJ8umvkwo1q9VJ6q

gavi · 2025-11-21T15:21:55Z

gavi
Nov 21, 2025

Mine just finished the mid-point checkpoint after 10 days.

What would it take to change the dtype to fp4, would that reduce training time?

0 replies

LokiMetaSmith · 2025-11-23T18:33:41Z

LokiMetaSmith
Nov 23, 2025

Strix halo 128gb user reporting in. I just got it compiling (no crashes so far) and here's my current stats

can find all the details on the master branch of my fork, https://github.com/LokiMetaSmith/nanochat/tree/master

4 replies

raphaelamorim Nov 24, 2025

wow, 4x slower than 1 spark.

czardoz Nov 24, 2025

No, it's ~3x faster than 1 spark. (~4 days for the pre-training run, vs 9 days for the spark)

LokiMetaSmith Nov 24, 2025

yeah, this is without any optimizations. I calculated that it'll probably take 39-42 days to complete the bash speedrun.sh script.

I basically just smashed the "jules, this didn't work" button, over the weekend, until it stopped crashing. I tried that earlier, but some code changes made it a lot more possible, and I got better with figuring out the funky rocm amd drivers needed to get all the stars to align.

If others are seeing a 10x improvement with some tuning, I'm hopeful, I'll see something similar.

Vocab size: 65,536
num_layers: 20
model_dim: 1280
num_heads: 10
num_kv_heads: 10
Tokens / micro-batch / rank: 16 x 2048 = 32,768
Tokens / micro-batch: 32,768
Total batch size 524,288 => gradient accumulation steps: 16
Number of parameters: 560,988,160
Estimated FLOPs per token: 3.491758e+09
Calculated number of iterations from target data:param ratio: 21,400
Total number of training tokens: 11,219,763,200
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 3.917670e+19
Scaling the LR for the AdamW parameters ∝1/√(1280/768) = 0.774597
/home/loki/Projects/nanochat/.venv/lib/python3.12/site-packages/torch/_inductor/compile_fx.py:236: UserWarning: T ensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting torch.se t_float32_matmul_precision('high') for better performance.

FlorinAndrei Nov 24, 2025

87.26 minutes for 0.2% = 30 days for 100%

nurcunal · 2025-11-25T09:24:53Z

nurcunal
Nov 25, 2025

Has anyone tried the $1,000 run (d32)? We've seen that the $100 run (d20) takes about 10 days on a single Spark and 5 days on a pair. I'm curious as to how long it would take a dual Spark setup to complete the d32 run. I'm assuming it would take about a month, but that's just my crude estimation. If someone who owns a Spark could just boot it up and get a few steps in we could extrapolate the total time from there.

I ask this because I'm trying to understand if the Spark is a feasible device for training small (<2b) LLMs and conducting ablation studies using smaller variants of a given model architecture. I'm optimistic that if the training pipeline was rewritten to train in nvfp4 which the Spark is optimized for, a dual Spark setup could be a viable method of training small LLMs as a cheap alternative (albeit much slower) to serious clusters.

Also, is anyone aware if more than 2 Sparks can be connected to each other? Is this natively supported? Could 4 or 8 Sparks have potential to train reasonably sized models if that was the case.

4 replies

alttools Nov 25, 2025

I don't see why 3 sparks couldn't be connected in like a triangle formation with just static routing defined on each host, for more than 3 you'd need to get a special switch for contectx 7.

I am no expert, just this is what I know.

For the training time, since the "$1000" run is actually $800, and that's essentially the amount of compute, then it would be roughly 8x longer, so 40 days on 2x sparks.

In theory if the system was changed to fp8 that would take roughly half the time, ~20 days.

Again nvfp4 could be twice as fast as fp8, bringing the 2x spark D32 run to be roughly 10 days on 2 sparks.

There is still some software that needs to be updated to support the spark's hardware, such as hardware accelerated stochastic rounding in transformer engine for the sm_121 devices (the spark). As far as I have read, I have not seen a successful nvfp4 training of anything on the spark, but I have not been looking very hard this last week.

FlorinAndrei Nov 25, 2025

if the Spark is a feasible device for training small (<2b) LLMs

With a context size of 512, I was able to do full fine-tuning of Gemma 3 12B on a single Spark. RAM is close to 100% used, but there is no active swapping. Training dataset is 30k social media comment pairs (prompt/answer), and 1 epoch takes like 6 hours.

With QLoRA in 4 bit mode, memory usage in training is less than half (close to 1/3), and it's a tiny bit faster.

FlorinAndrei Nov 25, 2025

I don't see why 3 sparks couldn't be connected in like a triangle formation with just static routing defined on each host

Isn't the connector designed for just 2 devices? Are there other types of connector?

There is still some software that needs to be updated to support the spark's hardware, such as hardware accelerated stochastic rounding in transformer engine for the sm_121 devices

Which library needs to be updated? Is it PyTorch?

alttools Nov 27, 2025

Transformer engine needs to support sm_121, I think there is a pull request for that feature now though

jasonacox · 2025-12-02T03:45:07Z

jasonacox
Dec 2, 2025

"The best ChatGPT that $8 can buy?"

Running pretraining on the DGX Spark seems to be remarkably economical. For my 9-day pretraining run, the system drew about 120 watts (often less but rounding up), resulting in roughly $8 of electricity usage based on local utility rates. But to be fair, we need to include other costs to get real vs. cloud service costs. But when factoring in hardware depreciation, the numbers still stay favorable. Assuming a 3–4 year lifespan for the $4,000 Spark, depreciation comes out to about $2.74–$3.65 per day, or $25–$33 for the full run. Altogether, the total cost to train locally was at most about $41. So I guess maybe, "The best ChatGPT that $41 can buy?" 😁

0 replies

bassyass · 2025-12-28T12:20:24Z

bassyass
Dec 28, 2025

I'm pretraining nanochat on a DGX spark ( asus ) and getting ~17.5k tokens/sec (logs below). I tried switching to NVFP4 using (https://github.com/alint77/nanogpt-fp8) to improve throughput, Everything looks correct to me but it actually dropped to 13k tokens/sec.

1 reply

seanGSISG Jan 5, 2026

NVFP4 is half baked on the GB10 in its current state. Check out https://forums.developer.nvidia.com/t/psa-state-of-fp4-nvfp4-support-for-dgx-spark-in-vllm/353069

openmarmot · 2026-01-03T19:36:54Z

openmarmot
Jan 3, 2026

I mashed together all of the data from this thread and some other discussions and got a implementation working with the nvidia container on the spark.

It is in pre-training now but seems to be working well. Thanks to everyone for putting their notes here, I made sure to reference it in my notes.

https://github.com/openmarmot/tech-notes/tree/main/nvidia/DGX_Spark/nanochat

1 reply

openmarmot Jan 4, 2026

still running. started at 10k and its bouncing around a bit depending on what else is going on.

One cpu core was maxed out at 100% with kcompactd0 for awhile which dropped the speed from 10k t/s to 5k t/s for a couple hours but that eventually cleared up on its own and its now back to 10k t/s.

The kcompactd0 issue seems to be reoccurring. Research suggests this is a known issue with no real fixes, but I wonder if perhaps the spark needs some special commands due to its shared memory structure.
https://bugzilla.redhat.com/show_bug.cgi?id=1694305

Temperature seems .. fine? I don't think it is as big of an issue as people have made it out to be. Yes the case gets warm - it is doing its job dissipating heat. nvtop reports a consistent GPU usage of 96% with almost no spiking so I think it isn't throttling due to heat.

gopitk · 2026-01-08T00:00:10Z

gopitk
Jan 8, 2026

If anyone is interested, I have been experimenting with quantized training with the low precision mxfp8/nvfp4 datatypes supported on Blackwell GPU hardware (like B200+). Tried with both Nvidia transformer engine and torchao framework on a fork of nanochat from a month or so back.

https://github.com/gopitk/nanochat/tree/mxfp8TE (mxfp8 and nvfp4)
https://github.com/gopitk/nanochat/tree/mxfp8TAO (mxfp8 only)

These data types help maintain accuracy while making things more efficient. (~20% faster out of the box with transformer engine. Not yet the case with torchao probably unless model is much larger/wider. All my tests were with a B200 not DGX Spark).

Appreciate any feedback.

2 replies

vindiesel Jan 9, 2026

I hope to try this tomorrow. I checked out the mxfp8TE branch but wasn't able to install the deps locally on spark, i'll try the other branch and see if i can get it working.

gopitk Jan 9, 2026

I started my env with nvidia/pytorch:25.09-py3 docker image on my b200.

vindiesel · 2026-01-09T03:25:14Z

vindiesel
Jan 9, 2026

what is the fastest training performance people are seeing on dgx spark for d20? and what optimizations (e.g. precision) were used to get that performance?

i'm getting ~16.5-17k using bfloat16, only extra optimization I did was built pytorch from source so it would support sm_121 (code from Dockerfile below)

RUN git clone --recursive https://github.com/pytorch/pytorch /opt/pytorch
RUN uv python install 3.11 && \
    uv run --python 3.11 python -m venv /opt/pytorch_venv && \
    . /opt/pytorch_venv/bin/activate && \
    cd /opt/pytorch && \
    git checkout v2.9.1 && \
    python -m pip install -r requirements-build.txt && \
    python -m pip install -r requirements.txt && \
    rm -rf build/CMakeCache.txt build/CMakeFiles && \
    MAX_JOBS=16 \
    TORCH_CUDA_ARCH_LIST="12.1" \
    BUILD_TEST=0 \
    USE_MKLDNN=0 \
    USE_NNPACK=0 \
    USE_FBGEMM=0 \
    USE_PRIORITIZED_TEXT_FOR_LD=1 \
    USE_CUDNN=1 \
    USE_FLASH_ATTENTION=1 \
    USE_OPENMP=1 \
    python setup.py bdist_wheel

Logs from the training run:

step 21014/21400 (98.20%) | loss: 2.746527 | grad norm: 0.0731 | lrm: 0.09 | dt: 31352.47ms | tok/sec: 16,722 | mfu: 5.90 | total time: 10998.20m
step 21015/21400 (98.20%) | loss: 2.761104 | grad norm: 0.0680 | lrm: 0.09 | dt: 31216.42ms | tok/sec: 16,795 | mfu: 5.93 | total time: 10998.72m
step 21016/21400 (98.21%) | loss: 2.763904 | grad norm: 0.0689 | lrm: 0.09 | dt: 31385.87ms | tok/sec: 16,704 | mfu: 5.90 | total time: 10999.25m
step 21017/21400 (98.21%) | loss: 2.755796 | grad norm: 0.0690 | lrm: 0.09 | dt: 31352.10ms | tok/sec: 16,722 | mfu: 5.90 | total time: 10999.77m
step 21018/21400 (98.21%) | loss: 2.762699 | grad norm: 0.0688 | lrm: 0.09 | dt: 31401.92ms | tok/sec: 16,696 | mfu: 5.89 | total time: 11000.29m
step 21019/21400 (98.22%) | loss: 2.753779 | grad norm: 0.0654 | lrm: 0.09 | dt: 31315.13ms | tok/sec: 16,742 | mfu: 5.91 | total time: 11000.81m
step 21020/21400 (98.22%) | loss: 2.759112 | grad norm: 0.0669 | lrm: 0.09 | dt: 31241.15ms | tok/sec: 16,781 | mfu: 5.93 | total time: 11001.33m
step 21021/21400 (98.23%) | loss: 2.757465 | grad norm: 0.0716 | lrm: 0.09 | dt: 31271.96ms | tok/sec: 16,765 | mfu: 5.92 | total time: 11001.85m
step 21022/21400 (98.23%) | loss: 2.756569 | grad norm: 0.0658 | lrm: 0.09 | dt: 31292.93ms | tok/sec: 16,754 | mfu: 5.92 | total time: 11002.38m
step 21023/21400 (98.24%) | loss: 2.753099 | grad norm: 0.0643 | lrm: 0.09 | dt: 31312.62ms | tok/sec: 16,743 | mfu: 5.91 | total time: 11002.90m

0 replies

vindiesel · 2026-01-16T13:57:58Z

vindiesel
Jan 16, 2026

pre-training on dgx spark is no longer working for me after this commit

Here is the bug I'm running into:

>>> from kernels import get_kernel
# https://github.com/karpathy/nanochat/blob/master/nanochat/gpt.py#L32
>>> flash_attn = get_kernel('varunneal/flash-attention-3').flash_attn_interface
/tmp/tmp.1KtN38wPMG/.venv/lib/python3.11/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Fetching 4 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 528.02it/s]
Download complete: : 0.00B [00:00, ?B/s]              Traceback (most recent call last):                                              | 0/4 [00:00<?, ?it/s]
  File "<stdin>", line 1, in <module>
  File "/tmp/tmp.1KtN38wPMG/.venv/lib/python3.11/site-packages/kernels/utils.py", line 319, in get_kernel
    return _import_from_path(package_name, variant_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/tmp.1KtN38wPMG/.venv/lib/python3.11/site-packages/kernels/utils.py", line 156, in _import_from_path
    spec.loader.exec_module(module)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/vindiesel9876/.cache/huggingface/hub/models--varunneal--flash-attention-3/snapshots/add01af002563fdeff03a8e5fb77ce497d202055/build/torch29-cxx11-cu130-aarch64-linux/flash_attention_3/__init__.py", line 1, in <module>
    from .flash_attn_interface import *
  File "/home/vindiesel9876/.cache/huggingface/hub/models--varunneal--flash-attention-3/snapshots/add01af002563fdeff03a8e5fb77ce497d202055/build/torch29-cxx11-cu130-aarch64-linux/flash_attention_3/flash_attn_interface.py", line 10, in <module>
    from . import _C # Registers operators with PyTorch
    ^^^^^^^^^^^^^^^^
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Anyone else seeing this error? Anyone got a fix?

6 replies

vindiesel Jan 16, 2026

Thank you!

karpathy Jan 16, 2026
Maintainer

(Should be fixed now, which you probably noticed. ty!)

varunneal Jan 20, 2026

I'll aim to upload a DGX Spark compatible build to the flash attention kernel later this week.

varunneal Jan 21, 2026

Nevermind, it turns out that it's basically impossible to build FA3 on a blackwell device. The current solution of using pytorch's scaled_dot_product_attention is pretty good. For closer to the FA3 function being used,torch.nn.attention.varlen may be of interest, though it does not support window sizes. See docs here https://docs.pytorch.org/tutorials/intermediate/variable_length_attention_tutorial.html. (In order to implement Sliding Window Attention with PyTorch, Flex Attention with custom block masks would be needed).

In the distant future it may be possible to host a FA4 build on Huggingface, though the development of that library is not complete.

vgoklani Jan 21, 2026

@varunneal why not just implement the sliding window using flex attention? This way you get sparsity for free :)

https://pytorch.org/blog/flexattention/

Anyone managed to run training on an NVIDIA Spark yet? #28

Uh oh!

Uh oh!

Replies: 23 comments · 99 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone repo and make modifications

Update requirements and switch to CUDA 13.0

Install UV, install repo dependencies, activate venv

Build and train the tokenizer

Install CUDA 13.0.2

Run pre-training

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 23 comments 99 replies