Skip to content

Conversation

@CompRhys
Copy link

@CompRhys CompRhys commented Oct 4, 2025

See #553

@cw-tan
Copy link
Collaborator

cw-tan commented Oct 7, 2025

CI is erroring out because the ruff linter is failing btw. Anyway, @kavanase could you please try packaging on a logic node that is CPU only to see if this fixes this problem.

@CompRhys
Copy link
Author

CompRhys commented Oct 7, 2025

Thanks for the catch. I made some tweaks to the pre-commit hooks to made it consistent with the lint in CI because I was having issues with a black autoformatting loop when I initially tried to call the pre-commit to fix the lint issue I missed.

@cw-tan
Copy link
Collaborator

cw-tan commented Oct 7, 2025

Thanks! I completely forgot to update our pre-commit hooks since our migration to ruff

@kavanase
Copy link
Contributor

kavanase commented Oct 8, 2025

Anyway, @kavanase could you please try packaging on a logic node that is CPU only to see if this fixes this problem
Sorry for the delay! Got held up with travel.

Here's the NequIP package file from a login node without access to GPU: (too large to directly upload here; 72 Mb)
https://drive.google.com/file/d/1CR2uxLanZdorZyhYuVUcKuf9Ku1rMPvR/view?usp=sharing

@CompRhys
Copy link
Author

CompRhys commented Oct 8, 2025

Thanks! can these more portable models be uploaded to the endpoint for "nequip.net:mir-group/NequIP-OAM-L:0.1"?

@cw-tan
Copy link
Collaborator

cw-tan commented Oct 8, 2025

I'm only guessing that packaging on CPU-only devices can fix this problem, would need to check first before we update the website links, etc

@kavanase
Copy link
Contributor

kavanase commented Oct 8, 2025

There was an issue before with (accelerated) GPU inference when packaged on CPU-only devices though right? @cw-tan
Is that avoided now?

@CompRhys
Copy link
Author

CompRhys commented Oct 8, 2025

When there were similar issues with MACE in the past the solution was to make the default that it always casts the model to CPU before saving regardless of device. I don't think you need to be on a machine that specifically doesn't have access to a GPU. I am not sure if the inductor stage might change or complicate any of this

@cw-tan
Copy link
Collaborator

cw-tan commented Oct 8, 2025

There was an issue before with (accelerated) GPU inference when packaged on CPU-only devices though right? @cw-tan Is that avoided now?

@kavanase Yes, that's resolved. The problem was more for Allegro, see mir-group/allegro@1b1b230

When there were similar issues with MACE in the past the solution was to make the default that it always casts the model to CPU before saving regardless of device. I don't think you need to be on a machine that specifically doesn't have access to a GPU. I am not sure if the inductor stage might change or complicate any of this

@CompRhys Good point, potentially worth sending all models to CPU before packaging in https://github.com/mir-group/nequip/blob/main/nequip/scripts/package.py . The checkpoint loading is a bit automagical since we just depend on Lightning (

lightning_module = training_module.load_from_checkpoint(checkpoint_path)
), so I'm unsure if it does some magic device handling when loading. Regardless, hopefully just manual .to("cpu") in the package script is enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants