Skip to content

Add TorchNEP: a pure-PyTorch NEP4 training framework#1574

Open
mushroomfire wants to merge 8 commits into
brucefan1983:masterfrom
mushroomfire:master
Open

Add TorchNEP: a pure-PyTorch NEP4 training framework#1574
mushroomfire wants to merge 8 commits into
brucefan1983:masterfrom
mushroomfire:master

Conversation

@mushroomfire

Copy link
Copy Markdown
Contributor

Summary

This PR adds TorchNEP, a pure-PyTorch implementation of the NEP4 training framework, as a self-contained torchnep/ subproject under the GPUMD repository. It produces GPUMD-compatible nep.txt potentials and is fully interoperable with GPUMD (a model trained by TorchNEP loads and runs in GPUMD, and vice versa).

Key features:

  • GPUMD-compatible nep.txt output, bit-for-bit descriptor parity with GPUMD (verified against baked GPUMD references in the test suite)
  • Two-stage training (force-focused → energy-focused), fine-tuning, and model slimming
  • Single-GPU/CPU training plus data-sharded multi-GPU/multi-node training via DDP
  • Optional ZBL repulsion and an ASE calculator interface

Modification

  • Add the torchnep/ Python package (model, descriptors/ops, training, prediction, ASE calculator, neighbor search).
  • Add torchnep/tests/ — a pure-pytest suite (GPUMD parity, descriptors, neighbor lists, parsing, ASE), with baked reference fixtures so it runs on CPU without a GPUMD build.
  • Add a worked example under torchnep/example/PbTe/.
  • Add packaging (pyproject.toml, README.md, LICENSE, GPL-3.0-or-later) so TorchNEP can be published to PyPI.
  • Add two GitHub Actions workflows (scoped to torchnep/** so ordinary GPUMD PRs are unaffected):
    torchnep-test.yml — runs the CPU pytest suite, triggered only when TorchNEP .py/pyproject.toml files change.
    torchnep-publish.yml — builds and publishes to PyPI on a torchnep-v* tag, via PyPI Trusted Publishing.

Others

The CI tests run on CPU only and need no GPUMD binary (parity is checked against committed reference fixtures).
No changes to any existing GPUMD C++/CUDA sources, build system, or workflows; this PR is purely additive.

@Dankomaister

Copy link
Copy Markdown
Contributor

This is great!

I tested TorchNEP on some of my more challenging systems and the accuracy is much better than SNES-NEP, not to mention it is significantly faster to train.
What would be nice to have for TorchNEP is.

  • Support for validation data. (i.e. test.xyz for SNES-NEP) for use with lr_scheduler and for determining nep_best.txt. I do have some systems which show overfitting with SNES-NEP and it would be nice to detect this also in TorchNEP.
  • Unique training runs. As far as I can see, the descriptor parameters and NN weights are initialized randomly for each new run, but the same seed (the epoch number) is used when shuffling structures. Thus, each independent run will see the same order of structures, which is not ideal when training an ensemble for active learning.
  • More robust xyz reader. I have noticed that the code crashes when reading some of my xyz datasets because these happen to have a space between the energy key and the value, for example energy= 0.2151 etc.

/Daniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants