Replies: 23 comments 99 replies
-
|
With 120GB the original device_batch_size=32 should work, I think, because it's per device? On a single 4090 24GB I'm training with device_batch_size=4, and it's using 17 / 24 GB. First I tried with torchrun and |
Beta Was this translation helpful? Give feedback.
-
|
No, but I'm trying to train on 2 x Nvidia RTX PRO 6000s, with the same batch size as in the script (32), it takes about 20h for the pre-training step to finish (still training);
|
Beta Was this translation helpful? Give feedback.
-
|
I'm seeing pretty abysmal tok/sec on my DGX spark: I haven't tuned anything tbh, just upgraded torch to 2.9, and set |
Beta Was this translation helpful? Give feedback.
-
|
So, does anyone have a duration estimate for training nanochat (either the $100 version or the $1000 version) on the DGX Spark? |
Beta Was this translation helpful? Give feedback.
-
|
I have managed to get it running on my DGX Spark just now natively (without resorting to a docker container), but it took a couple of hours of trial and error, going through documentation and forums, and consulting with GPT-5 to figure out all the steps necessary... This is the approach that worked for me: Clone repo and make modificationsGet the repo and change into the project directory: Update requirements and switch to CUDA 13.0I found it necessary to increase the dependency requirements for torch to 2.9.0 and for triton to 3.5.0 and to switch from CUDA 12.8 to 13.0. To do that, you need to change [project]
name = "nanochat"
version = "0.1.0"
description = "the minimal full-stack ChatGPT clone"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"datasets>=4.0.0",
"fastapi>=0.117.1",
"files-to-prompt>=0.6",
"numpy==1.26.4",
"psutil>=7.1.0",
"regex>=2025.9.1",
"setuptools>=80.9.0",
"tiktoken>=0.11.0",
"tokenizers>=0.22.0",
"torch>=2.9.0",
"triton>=3.5.0",
"uvicorn>=0.36.0",
"wandb>=0.21.3",
]
[build-system]
requires = ["maturin>=1.7,<2.0"]
build-backend = "maturin"
[tool.maturin]
module-name = "rustbpe"
bindings = "pyo3"
python-source = "."
manifest-path = "rustbpe/Cargo.toml"
[dependency-groups]
dev = [
"maturin>=1.9.4",
"pytest>=8.0.0",
]
[tool.pytest.ini_options]
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
# target torch to cuda 13.0 or CPU
[tool.uv.sources]
torch = [
{ index = "pytorch-cpu", extra = "cpu" },
{ index = "pytorch-cu130", extra = "gpu" },
]
[[tool.uv.index]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
explicit = true
[[tool.uv.index]]
name = "pytorch-cu130"
url = "https://download.pytorch.org/whl/cu130"
explicit = true
[project.optional-dependencies]
cpu = [
"torch>=2.9.0",
]
gpu = [
"torch>=2.9.0",
]
[tool.uv]
conflicts = [
[
{ extra = "cpu" },
{ extra = "gpu" },
],
]Install UV, install repo dependencies, activate venvNow you're ready to continue following the installation instructions for nanochat as per #1 in particular the following steps to install UV, install all repo dependencies, and activate the venv: # install uv (if not already installed)
command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
# create a .venv local virtual environment (if it doesn't exist)
[ -d ".venv" ] || uv venv
# install the repo dependencies
uv sync
# activate venv so that `python` uses the project's venv instead of system python
source .venv/bin/activateBuild and train the tokenizerYou can continue to follow the instructions to build the tokenizer: # Install Rust / Cargo
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
# Build the rustbpe Tokenizer
uv run maturin develop --release --manifest-path rustbpe/Cargo.tomlTo download the training dataset: python -m nanochat.dataset -n 240And to train the tokenizer and evaluate it: python -m scripts.tok_train --max_chars=2000000000
python -m scripts.tok_evalIf you haven't already done so previously, you also should download the eval bundle at this time: curl -L -o eval_bundle.zip https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip
unzip -q eval_bundle.zip
rm eval_bundle.zip
mv eval_bundle "$HOME/.cache/nanochat"Install CUDA 13.0.2The next step in the nanochat instructions would be to now run pre-training, but that step will fail, because the default At this time, you need to go to the nVIDIA Developer website and install CUDA 13.0.2 manually by following the steps here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=arm64-sbsa&Compilation=Native&Distribution=Ubuntu&target_version=24.04&target_type=deb_local In particular, this was the sequence that worked for me on the DGX Spark: wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda-repo-ubuntu2404-13-0-local_13.0.2-580.95.05-1_arm64.deb
sudo dpkg -i cuda-repo-ubuntu2404-13-0-local_13.0.2-580.95.05-1_arm64.deb
sudo cp /var/cuda-repo-ubuntu2404-13-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0And now you need to tell Triton to use the new # assuming CUDA 13.0 is installed at /usr/local/cuda-13.0
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH}Run pre-trainingNow you should be able to run pre-training on your GDX Spark with the usual command from the nanochat instructions: torchrun --standalone --nproc_per_node=gpu -m scripts.base_train -- --depth=20That's what did the trick for me and here is the result of nanochat running on my DGX Spark: As you can see in the above output, it still shows a brief warning that the GB10 has even more cuda capabilities than what PyTorch currently supports, but the training is now able to correctly run on the DGX Spark. Btw, I expect the pre-training to run for a few days, so I actually did all of the above in a |
Beta Was this translation helpful? Give feedback.
-
|
@afalk42 thanks for sharing your steps! Curious why in command |
Beta Was this translation helpful? Give feedback.
-
|
dgx spark is primarily built for inference workloads because it has low memory bandwidth and high memory. that said, i am going to be optimistic and try running nanochat on my laptop which has a 3050 ti mobile with 4 gb vram xD. Let's see, will update. |
Beta Was this translation helpful? Give feedback.
-
|
I suggest the non-DGX Spark threads be moved to their own discussions, instead of cluttering this one. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the detailed steps. I am able to run it on my dgx spark., now will wait for ~9 days for this to finish. :) step 00207/21400 (0.97%) | loss: 4.315866 | lrm: 1.00 | dt: 39788.09ms | tok/sec: 1,647 | mfu: 4.65 | total time: 130.30m |
Beta Was this translation helpful? Give feedback.
-
|
Excited to report early numbers from a multinode pretraining run on 2 DGX Sparks linked with a 200 Gbps interconnect. The depth is still 20 and batch size per device is 32, which uses about 90GB of unified memory per device. The reported tokens/sec increases from 1,600 to 6,600. I extrapolate a completion time of I expect an additional 4-6x speedup once I have my NVFP4 pretraining pipeline working, which should bring the total pretraining time down to roughly 1 day. The real question will be how much the quality degrades. For reference, I get about 12,000 tokens/sec on my single H100 server with an extrapolated completion time of 1 day. Andrej reports 1.1 million tokens/sec on 8 x H100s and a completion time of 3 hours. |
Beta Was this translation helpful? Give feedback.
-
|
Lurker here, thanks everyone for putting in such good information in this thread! I got my DGX spark yesterday and have been playing with it this morning and have some early success with this project (largely due to this thread). I think I got NVFP4 working. It's now cooking about 8x faster (1,650 t/s -> 13,000 t/s). I also changed a couple other things, learnable rmsnorm and swiglu for the activation function and it uses transformer engine fused attention for training.
I used ChatGPT-5 Pro to come up with some optimizations and had it make a "guide for a junior developer" to implement what looked like the biggest impact optimizations. This file is what I used to modify nanochat to use nvfp4, I didn't quantize the lm_head, but all the other linear layers should be using nvfp4 now. Ram is holding steady at 114 Gb with 20 layers and the normal batch size of 32. I am going to give it like an hour or two before trying to estimate the total time to do pre-training. |
Beta Was this translation helpful? Give feedback.
-
|
Mine finished pre-training a couple of days ago. Moved a few steps forward but it is erroring out at this step
Going to seek help with some coding assistants. Just posting here incase someone already seen this and solved it. Thanks |
Beta Was this translation helpful? Give feedback.
-
|
Since the DGX Spark is based on the Grace Blackwell architecture and natively supports mxfp4/nvfp4, can we leverage this feature to adjust from bfloat16 to mxfp4? |
Beta Was this translation helpful? Give feedback.
-
|
Congrats on successful training on the Spark and thanks for sharing. What's your experience on the thermal performance of the machine during 9+-day-long session.. I find a few heat related reports on NV's DGX Spark Forum and wonder how the heat would affect the job. |
Beta Was this translation helpful? Give feedback.
-
|
Mine just finished the mid-point checkpoint after 10 days. What would it take to change the dtype to fp4, would that reduce training time? |
Beta Was this translation helpful? Give feedback.
-
|
Strix halo 128gb user reporting in. I just got it compiling (no crashes so far) and here's my current stats u: 1.13 | total time: 73.63m can find all the details on the master branch of my fork, https://github.com/LokiMetaSmith/nanochat/tree/master |
Beta Was this translation helpful? Give feedback.
-
|
Has anyone tried the $1,000 run (d32)? We've seen that the $100 run (d20) takes about 10 days on a single Spark and 5 days on a pair. I'm curious as to how long it would take a dual Spark setup to complete the d32 run. I'm assuming it would take about a month, but that's just my crude estimation. If someone who owns a Spark could just boot it up and get a few steps in we could extrapolate the total time from there. I ask this because I'm trying to understand if the Spark is a feasible device for training small (<2b) LLMs and conducting ablation studies using smaller variants of a given model architecture. I'm optimistic that if the training pipeline was rewritten to train in nvfp4 which the Spark is optimized for, a dual Spark setup could be a viable method of training small LLMs as a cheap alternative (albeit much slower) to serious clusters. Also, is anyone aware if more than 2 Sparks can be connected to each other? Is this natively supported? Could 4 or 8 Sparks have potential to train reasonably sized models if that was the case. |
Beta Was this translation helpful? Give feedback.
-
|
"The best ChatGPT that $8 can buy?" Running pretraining on the DGX Spark seems to be remarkably economical. For my 9-day pretraining run, the system drew about 120 watts (often less but rounding up), resulting in roughly $8 of electricity usage based on local utility rates. But to be fair, we need to include other costs to get real vs. cloud service costs. But when factoring in hardware depreciation, the numbers still stay favorable. Assuming a 3–4 year lifespan for the $4,000 Spark, depreciation comes out to about $2.74–$3.65 per day, or $25–$33 for the full run. Altogether, the total cost to train locally was at most about $41. So I guess maybe, "The best ChatGPT that $41 can buy?" 😁 |
Beta Was this translation helpful? Give feedback.
-
|
I'm pretraining nanochat on a DGX spark ( asus ) and getting ~17.5k tokens/sec (logs below). I tried switching to NVFP4 using (https://github.com/alint77/nanogpt-fp8) to improve throughput, Everything looks correct to me but it actually dropped to 13k tokens/sec.
|
Beta Was this translation helpful? Give feedback.
-
|
I mashed together all of the data from this thread and some other discussions and got a implementation working with the nvidia container on the spark. It is in pre-training now but seems to be working well. Thanks to everyone for putting their notes here, I made sure to reference it in my notes. https://github.com/openmarmot/tech-notes/tree/main/nvidia/DGX_Spark/nanochat |
Beta Was this translation helpful? Give feedback.
-
|
If anyone is interested, I have been experimenting with quantized training with the low precision mxfp8/nvfp4 datatypes supported on Blackwell GPU hardware (like B200+). Tried with both Nvidia transformer engine and torchao framework on a fork of nanochat from a month or so back. https://github.com/gopitk/nanochat/tree/mxfp8TE (mxfp8 and nvfp4) These data types help maintain accuracy while making things more efficient. (~20% faster out of the box with transformer engine. Not yet the case with torchao probably unless model is much larger/wider. All my tests were with a B200 not DGX Spark). Appreciate any feedback. |
Beta Was this translation helpful? Give feedback.
-
|
what is the fastest training performance people are seeing on dgx spark for d20? and what optimizations (e.g. precision) were used to get that performance? i'm getting ~16.5-17k using bfloat16, only extra optimization I did was built pytorch from source so it would support Logs from the training run: |
Beta Was this translation helpful? Give feedback.
-
|
pre-training on dgx spark is no longer working for me after this commit Here is the bug I'm running into: Anyone else seeing this error? Anyone got a fix? |
Beta Was this translation helpful? Give feedback.






Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The new Spark has 120GB of available RAM which makes things harder since the training scripts are currently setup to expect 8xH100 which is 640GB.
I turned off parallel GPUs and got an out of memory error, presumably I need to reduce batch sizes or similar.
Anyone else trying to get this to work on the new Spark machines?
Beta Was this translation helpful? Give feedback.
All reactions