-
-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new Mac/Linux launch script framework: modular, extensible and robust #477
Conversation
161cb4c
to
45f86a3
Compare
Since I had some time left over while waiting on this, I implemented a few more improvements: Added a Mac/Linux launcher script for custom CLI commands.This introduces Example usage:
Added detailed documentation for the new launcher scripts.If someone edits the scripts in the future, they will need to ensure that the documentation stays up-to-date. But since the new framework is very flexible and robust, and Conda/Venv has a stable API which doesn't change over time, there's probably never going to be any need to change these scripts in the future. |
c54d180
to
f9962de
Compare
1309508
to
148b1b5
Compare
Verified on all relevant Bash versions: - Linux Bash 5.2.26(1), a very recent release for Linux. - Mac Bash 3.2.57(1), since Apple uses an outdated Bash version due to GPL licensing changes in newer versions. --- Fixes the following bugs and issues from the old scripts: - There was no error checking whatsoever, so if a command failed, the old scripts just happily continued with the next line. The new framework verifies the result of every executed line of code, and exits if there's even a single error. - Python version was being checked on the host and failed if the host's Python was wrong, instead of checking inside the Conda/Venv environment, which defeated the entire purpose of having a Conda/Venv in the first place. This check now verifies the actual environment, not the host. - The Conda environment check was very flawed. It searched for `ot` anywhere in the output of `conda info --envs`, meaning that if the letters "ot" appeared anywhere, it happily assumed that the OneTrainer Conda environment exists. For example, `notcorrect /home/otheruser/foo` would have incorrectly matched the old "ot" check. We now use a strict check instead, to ensure that the exact environment exists. - The old scripts checked for CUDA by looking for a developer binary, `nvcc`, which doesn't exist in normal NVIDIA CUDA drivers, thereby failing to detect CUDA on all modern systems. It has now been revised to look for either `nvidia-smi` (normal drivers) or `nvcc` (CUDA developer tools) to detect NVIDIA users. We could even have removed `nvcc` entirely, but it didn't hurt to keep it. - It failed to detect Conda at all if Conda's shell startup hook had executed, since their hook shadows `conda` into becoming a Shell function rather than a binary, which therefore failed the `command -v conda` check. That has now been corrected to accurately detect Conda's path regardless of circumstances. - The old method of launching Conda was absolutely nuts. It created a new sub-shell, sourced the `.bashrc` file to pretend to be an interactive user session, then ran `conda activate` followed by `python`. None of that was correct, and was extremely fragile (not to mention having zero error checking). The `conda activate` command is ONLY meant for user sessions, NOT for scripting, and its behavior is very unreliable. We now use the correct `conda run` shell scripting command instead! - The old method for "reinstalling requirements.txt" was incorrect. All it did was `pip --force-reinstall`, which just forces pip to reinstall its own, old, outdated, cached versions of the packages from disk, and tells it to reinstall them even if they were already installed. So all it did was waste a LOT of time, and still only upgraded requirements if the on-disk versions were no longer satisfying "requirements.txt" at all (such as if the minimum version constraint had been raised). It never updated deeper dependencies in the chain either, so if something like "PyTorch" depends on "numpy", and "numpy" depends on "otherlibrary without a version constraint", then "otherlibrary" was not updated since the old library on disk was still treated as "hey, the user has that library and the version constraint hasn't become invalid, so keep their old version. Now, all of that has been completely overhauled: We tell pip to eagerly upgrade every dependency to the latest versions that satisfy "requirements.txt", thereby ensuring that all libraries will be upgraded to the same versions as a fresh reinstall of "requirements.txt". A true upgrade. And it's also much, much faster, since it now only reinstalls libraries that have actually changed! - The old scripts did not handle the working-directory at all, which meant that the user had to manually `cd OneTrainer` before being able to run any of the shell scripts. This has now been fixed so that the working-directory is always the project directory, so that all resources can be found. - All of the old checks for executable binaries, venv directories, etc, used a mixture of a few modern and mostly very outdated Bash programming methods, and were therefore very fragile. For example, if the `command -v` lookup for a binary returned a path with spaces, then the old script's checks failed to find that binary at all. - Previous checks for the existence of a venv only looked for the directory, which could easily give false positives. We now check for the venv's `bin/activate` file instead, to be sure that the user's given venv path is truly a venv. - The old Python version check was very flimsy, executing two Python commands and checking each version component one by one in unreliable Bash code, and then printing two duplicated, subtly different error messages, instead of just checking both at once. This has now been completely overhauled to introduce a version check utility script (compatible with Python 2+), which takes the "minimum Python version" and "too high version" requirements and then verifies that the Python interpreter conforms to the desired version range. It supports `MAJOR`, `MAJOR.MINOR` and `MAJOR.MINOR.PATCH` version specifiers, to give developers complete flexibility to specify exactly which Python version OneTrainer needs. The Windows batch scripts should definitely be revised to use the same utility script. Lastly, we only print a single, unified and improved error message now. - The previous version check error message was recommending the huge, 3+ GB Conda, which contains around 2000 pre-installed scientific libraries, when Miniconda is much better. Miniconda is just the official package manager, which then installs exactly what you need on-demand instead of slowly pre-installing tons of bloat that you don't need. The error message has also been improved to describe how to use `pyenv` to achieve a valid Python Venv environment without needing to use Anaconda at all. - The previous `update.sh` script did not update OneTrainer if there were merge conflicts in the repository. It just continued onwards with the "reinstall pip dependencies" step as if nothing was wrong, even though the update hadn't been downloaded at all. We now abort the update process and let the user read the Git error message if there are any problems, so that they can see and manually resolve the merge conflicts in an appropriate way (such as by stashing or branching the local changes). This means that it's finally safe to update OneTrainer when you have local changes in the repository. --- New features in the new launch script framework: - All code is unified into a single library file, `lib.include.sh`, which is intentionally marked as non-executable (since it's only used by other scripts). There is no longer any fragile code duplication anywhere. - All shell scripts are now only a few lines of code, as they import the library to achieve their tasks effortlessly. This also makes it incredibly easy to create additional shell scripts for the other OneTrainer tools, if desired. - The new library is written from the ground up to use modern best-practices and shell functions, as a modular and easily extensible framework for any future project requirements. - All script output is now clearly prefixed with "[OneTrainer]" to create visible separation between random third-party log output and the lines of text that come from OneTrainer's shell scripts. - The commands that we execute are now visibly displayed to the user, so that they can see exactly what the launch scripts are doing. This helps users and developers alike, by producing better action logs. - The pip handling is improved to now always use `pip` as a Python module, thus getting rid of the unreliable `pip` binary. - Before installing any requirements, we now always upgrade `pip` and `setuptools` to the newest versions, which often contain bug fixes. This change ensures the smoothest possible dependency installations. - Environment variable handling has been completely overhauled, using best-practices for variable names, such as always using ALL_CAPS naming patterns and having a unique prefix to separate them from other variables. They are now all prefixed by `OT_` to avoid the risk of name clashes with system variables. - All important features of the scripts are now configurable via environment variables (instead of having to edit the script files), all of which have new and improved defaults as well: * `OT_CONDA_CMD`: Sets a custom Conda command or an absolute path to the binary (useful when it isn't in the user's `PATH`). If nothing is provided, we detect and use `CONDA_EXE`, which is a variable that's set by Conda itself and always points at the user's installed Conda binary. * `OT_CONDA_ENV`: Sets the name of the Conda environment. Now defaults to the clear and purposeful "onetrainer", since "ot" was incredibly generic and could clash with people's existing Conda environments. * `OT_PYTHON_CMD`: Sets the Host's Python executable. It's used for creating the Python Venvs. This setting is mostly-useless, since the default `python` is correct for the host in pretty much 100% of all cases, but hey, it doesn't hurt to let people configure it. * `OT_PYTHON_VENV`: Sets the name (or absolute/relative path) of the Python Venv, and now defaults to `.venv` (instead of `venv`), which is the standard-practice for how to name venv directories. Furthermore, the new code fully supports spaces in the path, which is especially useful when venv is on another disk, such as `OT_PYTHON_VENV="/home/user/My Projects/Envs/onetrainer"`, which is now a completely valid environment path. * `OT_PREFER_VENV`: If set to "true" (defaults to "false"), Conda will be ignored even if it exists on the system, and Python Venv will be used instead. This ensures that people who use `pyenv` (to choose which Python version to run on the host) can now easily set up their desired Python Venv environments, without having to hack the launch scripts. * `OT_CUDA_LOWMEM_MODE`: If set to "true" (defaults to "false"), it enables aggressive garbage collection in PyTorch to help with low-memory GPUs. The variable name is now very clear. * `OT_PLATFORM_REQUIREMENTS`: Allows the user to override which platform-specific requirements.txt file they want to install. Defaults to "detect", which automatically detects whether you have an AMD or NVIDIA GPU. But people with multi-GPU systems can use this setting to force a specific GPU acceleration framework. * `OT_SCRIPT_DEBUG`: If set to "true" (defaults to "false"), it enables debug printing. Currently, there's no debug printing in the scripts, but there's a `print_debug` shell function which uses this variable and only prints to the screen if debugging is enabled. This ensures that debugging can easily be activated by script developers in the future.
- This introduces `run-cmd.sh`, an intelligent wrapper which automatically runs the desired OneTrainer CLI script with the given command-line arguments. It takes care of configuring the correct Python environment and finally ensures that users will have a safe way to run custom commands. Example usage: `./run-cmd.sh train --config-path <path to your config>`
148b1b5
to
e5c1ab9
Compare
Don't mind the force-pushed changes. I've done some small revisions to the documentation for clarity, which can be seen in the "Compare" links above. I always want to ensure that beginners can understand everything, since that cuts down on the amount of support tickets. :) |
This fixes the issue where people may have integrated AMD graphics in their CPU along with a separate, dedicated NVIDIA GPU. By prioritizing NVIDIA, we ensure that the most likely dedicated GPU will be chosen in that scenario.
Thanks to Calamdor on Discord for discovering that multi-GPU systems should be preferring NVIDIA GPU acceleration by default, since that's always expected to be the stronger GPU. The following fix has now been added: Prioritize dedicated NVIDIA GPUs in multi-GPU systemsThis fixes the issue where people may have integrated AMD graphics in their CPU along with a separate, dedicated NVIDIA GPU. By prioritizing NVIDIA, we ensure that the most likely dedicated GPU will be chosen in that scenario. |
Calamdor on Discord has now confirmed the following systems:
This is in addition to my thorough tests where I went through every action multiple times with both environment backends (Conda and Venv):
|
We just got confirmation from habshi on Discord:
Yeah, the bitsandbytes library needs a rewrite for Macs, it's unrelated to the PR. So that means we have the following extra confirmations:
That's 3 people across 6 operating environments, including Microsoft's WSL (was great to see that tested). If we've had enough confirmations now, we can go ahead with the merge. :) |
I just tried on my arch derivative, EndeavourOS and got a failure. I've installed python 3.10 and 3.11 from the AUR.
Honestly, I don't know if you'd want to worry with this. Anyone foolish enough to use Arch should know what they're doing. Debian Bookworm ships 3.11 and Fedora has 3.10, so it's mostly Arch and the less popular distros. Would checking for python 3.10 in path work with pyenv as well? |
@ForgetfulWasAria Hey thanks for the feedback. That's a user error and the error explains what's going on. :)
It means that your default You have these solutions:
In either case, you will need to do [1] For example, here's what I see when forcing a venv to be created with my host's python3.12 binary: $ python3.12 -m venv .venv
$ ls -l .venv/bin
total 48
-rw-r--r--. 1 johnny johnny 2040 Sep 28 17:00 activate
-rw-r--r--. 1 johnny johnny 920 Sep 28 17:00 activate.csh
-rw-r--r--. 1 johnny johnny 2199 Sep 28 17:00 activate.fish
-rw-r--r--. 1 johnny johnny 9033 Sep 28 17:00 Activate.ps1
-rwxr-xr-x. 1 johnny johnny 246 Sep 28 17:00 pip
-rwxr-xr-x. 1 johnny johnny 246 Sep 28 17:00 pip3
-rwxr-xr-x. 1 johnny johnny 246 Sep 28 17:00 pip3.12
lrwxrwxrwx. 1 johnny johnny 10 Sep 28 17:00 python -> python3.12
lrwxrwxrwx. 1 johnny johnny 10 Sep 28 17:00 python3 -> python3.12
lrwxrwxrwx. 1 johnny johnny 19 Sep 28 17:00 python3.12 -> /usr/bin/python3.12 As you can see, the third method should work for you too, since it will then link the venv's I would always recommend Pyenv though. With Pyenv there's never any risk that the Python package gets removed by the host's system updates. But I suspect that Arch will keep the Python3.10 binary for years, so feel free to use the third method! |
Alright, because I was curious, I can confirm that the third method works if you prefer to do that. You only need to specify the A run where I let it use my system's default $ env OT_PREFER_VENV="true" ./start-ui.sh
[OneTrainer] Using Python Venv environment in ".venv"...
[OneTrainer] Creating Python Venv environment in ".venv"...
[OneTrainer] + python -m venv .venv
[OneTrainer] + python scripts/util/version_check.py 3 3.11
Error: Your Python version is too high: 3.12.6 (main, Sep 9 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)]. Must be >= 3 and < 3.11.
[OneTrainer] Solutions: Either install the required Python version via pyenv (https://github.com/pyenv/pyenv) and set the project directory's Python version with "pyenv install <version>" followed by "pyenv local <version>", or install Miniconda if you prefer that we automatically manage everything for you (https://docs.anaconda.com/miniconda/). Remember to manually delete any previous Venv or Conda environment which was created with a different Python version. I remove the Python 3.12 Venv: $ rm -rf .venv Now I force it to use $ env OT_PREFER_VENV="true" OT_PYTHON_CMD="python3.10" ./start-ui.sh
[OneTrainer] Using Python Venv environment in ".venv"...
[OneTrainer] Creating Python Venv environment in ".venv"...
[OneTrainer] + python3.10 -m venv .venv
[OneTrainer] + python scripts/util/version_check.py 3 3.11
[OneTrainer] Installing requirements in active environment...
[OneTrainer] + python -m pip install --upgrade --upgrade-strategy eager pip setuptools
^CERROR: Operation cancelled by user Now all subsequent runs will use the Python 3.10 venv, so you no longer need to specify the version. I personally still need to specify $ env OT_PREFER_VENV="true" ./start-ui.sh
[OneTrainer] Using Python Venv environment in ".venv"...
[OneTrainer] + python scripts/util/version_check.py 3 3.11
[OneTrainer] + python scripts/train_ui.py So feel free to use the Important notice: For most people, I would highly recommend Conda (for absolute beginners) and Pyenv (requires slightly more setup but is super reliable). I don't recommend using the host's python binary since you never know if it will stay the same version or get a breaking update, or even get removed by a system update. |
@Nerogar I'll add a small update to the docs to mention the host version override trick in more detail. After that, it's ready for merge. |
I understand what's going on. The issue is that "bleeding edge" distros generally don't default to python 3.10. Arch (and I think gentoo) have packages that allow installing python 3.10 with the binary name Personally, since the various AI packages all need different python versions, I tend to create the venv manually so that I know it has the needed python version. So you could:
And I can also confirm that adding the environment variable does work. Thanks! |
It's generally a bad and flaky idea to use the system's Python binaries, so that check will not be added. Conda and Pyenv are the only ways to guarantee a working Python version that won't randomly change or get deleted or have weird distro-specific tweaks. But I will document the override trick in more detail now and mention that advanced users can use it. Thanks for trying it out! :) I'll also add a written guide section that describes Conda and Pyenv in a bit more detail, rather than only mentioning them in the error message. |
This makes the setup instructions easier to follow, since they were previously only available as a short error message (which is displayed by the launch scripts whenever the user's Python version is incorrect).
We now instruct the users about how to upgrade their Conda environment to the required Python version, whenever their system contains an old environment.
Added a Python version setup guideThis makes the setup instructions easier to follow, since they were previously only available as a short error message (which is displayed by the launch scripts whenever the user's Python version is incorrect). The new section can be previewed here. Implemented Conda upgrade prompts for old Python environmentsWe now instruct the users about how to upgrade their Conda environment to the required Python version, whenever their system contains an old environment. Here's an example where I forced my $ conda create -y -n onetrainer python=3.12
[...]
$ ./start-ui.sh
[OneTrainer] Using Conda environment with name "onetrainer"...
[OneTrainer] + /home/johnny/.local/share/miniconda/bin/conda run -n onetrainer --no-capture-output python scripts/util/version_check.py 3 3.11
Error: Your Python version is too high: 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0]. Must be >= 3 and < 3.11.
ERROR conda.cli.main_run:execute(125): `conda run python scripts/util/version_check.py 3 3.11` failed. (See above for error)
[OneTrainer] Solution: Switch your Conda environment to the required Python version by deleting your old environment, and then run OneTrainer again.
To delete the outdated Conda environment, execute the following command:
"/home/johnny/.local/share/miniconda/bin/conda" remove -y -n "onetrainer" --all
$ "/home/johnny/.local/share/miniconda/bin/conda" remove -y -n "onetrainer" --all
[...]
$ ./start-ui.sh
[OneTrainer] Using Conda environment with name "onetrainer"...
[OneTrainer] Creating Conda environment with name "onetrainer"...
[OneTrainer] + /home/johnny/.local/share/miniconda/bin/conda create -y -n onetrainer python==3.10
[...] There are no breaking changes. We can go ahead with the merge. :) @Nerogar |
@Nerogar Hold one moment. I noticed that Conda is a bit stupid. If you say "Python 3.10" then it installs 3.10.0, meaning it always uses the oldest |
Previously, Conda always used the ".0" release of the desired Python version (such as "3.10.0"), which is always full of bugs. We now specify that we want the latest bugfix/patch release of the required Python version. People who use pyenv (instead of Conda) don't need to worry about this change, since pyenv always installs the latest bugfix releases by default.
Always use the latest bugfix releases of Python for Conda environmentsPreviously, Conda always used the ".0" release of the desired Python version (such as "3.10.0"), which is always full of bugs. We now specify that we want the latest bugfix/patch release of the required Python version. People who use pyenv (instead of Conda) don't need to worry about this change, since pyenv always installs the latest bugfix releases by default. This is yet again a non-breaking change, so the merge is finally ready! :) @Nerogar |
dd150ff
to
edebb64
Compare
edebb64
to
40cb4fa
Compare
I recently provided a deeper explanation of It's basically an explanation of why we do it, why it was necessary, and what it solves. Historically, pip always upgraded to the newest versions that still satisfy the requirements. But since many people install dependencies manually via So around 10 years ago, the decision was made to make pip "lazy by default". Nowadays, it keeps the currently installed package version, or installs outdated locally cached downloaded packages on fresh (Other Python package managers don't have this problem at all, by the way. They usually keep a "pinned packages" history of exactly what has been manually installed in the past, and will update everything to the newest versions that satisfies all old, manual installs. But since pip is a basic package manager, they had to make their defaults reasonable for beginners.) To still allow people to install actual package updates to fix bugs, they also added a custom That's the strategy we use. You can find it with So when users run our |
Verified on all relevant Bash versions:
Linux Bash 5.2.26(1), a very recent release for Linux.
Mac Bash 3.2.57(1), since Apple uses an outdated Bash version due to GPL licensing changes in newer versions.
Fixes the following bugs and issues from the old scripts:
There was no error checking whatsoever, so if a command failed, the old scripts just happily continued with the next line. The new framework verifies the result of every executed line of code, and exits if there's even a single error.
Python version was being checked on the host and failed if the host's Python was wrong, instead of checking inside the Conda/Venv environment, which defeated the entire purpose of having a Conda/Venv in the first place. This check now verifies the actual environment, not the host.
The Conda environment check was very flawed. It searched for
ot
anywhere in the output ofconda info --envs
, meaning that if the letters "ot" appeared anywhere, it happily assumed that the OneTrainer Conda environment exists. For example,notcorrect /home/otheruser/foo
would have incorrectly matched the old "ot" check. We now use a strict check instead, to ensure that the exact environment exists.The old scripts checked for CUDA by looking for a developer binary,
nvcc
, which doesn't exist in normal NVIDIA CUDA drivers, thereby failing to detect CUDA on all modern systems. It has now been revised to look for eithernvidia-smi
(normal drivers) ornvcc
(CUDA developer tools) to detect NVIDIA users. We could even have removednvcc
entirely, but it didn't hurt to keep it.It failed to detect Conda at all if Conda's shell startup hook had executed, since their hook shadows
conda
into becoming a Shell function rather than a binary, which therefore failed thecommand -v conda
check. That has now been corrected to accurately detect Conda's path regardless of circumstances.The old method of launching Conda was absolutely nuts. It created a new sub-shell, sourced the
.bashrc
file to pretend to be an interactive user session, then ranconda activate
followed bypython
. None of that was correct, and was extremely fragile (not to mention having zero error checking). Theconda activate
command is ONLY meant for user sessions, NOT for scripting, and its behavior is very unreliable. We now use the correctconda run
shell scripting command instead!The old method for "reinstalling requirements.txt" was incorrect. All it did was
pip --force-reinstall
, which just forces pip to reinstall its own, old, outdated, cached versions of the packages from disk, and tells it to reinstall them even if they were already installed. So all it did was waste a LOT of time, and still only upgraded requirements if the on-disk versions were no longer satisfying "requirements.txt" at all (such as if the minimum version constraint had been raised). It never updated deeper dependencies in the chain either, so if something like "PyTorch" depends on "numpy", and "numpy" depends on "otherlibrary without a version constraint", then "otherlibrary" was not updated since the old library on disk was still treated as "hey, the user has that library and the version constraint hasn't become invalid, so keep their old version. Now, all of that has been completely overhauled: We tell pip to eagerly upgrade every dependency to the latest versions that satisfy "requirements.txt", thereby ensuring that all libraries will be upgraded to the same versions as a fresh reinstall of "requirements.txt". A true upgrade. And it's also much, much faster, since it now only reinstalls libraries that have actually changed!The old scripts did not handle the working-directory at all, which meant that the user had to manually
cd OneTrainer
before being able to run any of the shell scripts. This has now been fixed so that the working-directory is always the project directory, so that all resources can be found.All of the old checks for executable binaries, venv directories, etc, used a mixture of a few modern and mostly very outdated Bash programming methods, and were therefore very fragile. For example, if the
command -v
lookup for a binary returned a path with spaces, then the old script's checks failed to find that binary at all.Previous checks for the existence of a venv only looked for the directory, which could easily give false positives. We now check for the venv's
bin/activate
file instead, to be sure that the user's given venv path is truly a venv.The old Python version check was very flimsy, executing two Python commands and checking each version component one by one in unreliable Bash code, and then printing two duplicated, subtly different error messages, instead of just checking both at once. This has now been completely overhauled to introduce a version check utility script (compatible with Python 2+), which takes the "minimum Python version" and "too high version" requirements and then verifies that the Python interpreter conforms to the desired version range. It supports
MAJOR
,MAJOR.MINOR
andMAJOR.MINOR.PATCH
version specifiers, to give developers complete flexibility to specify exactly which Python version OneTrainer needs. The Windows batch scripts should definitely be revised to use the same utility script. Lastly, we only print a single, unified and improved error message now.The previous version check error message was recommending the huge, 3+ GB Conda, which contains around 2000 pre-installed scientific libraries, when Miniconda is much better. Miniconda is just the official package manager, which then installs exactly what you need on-demand instead of slowly pre-installing tons of bloat that you don't need. The error message has also been improved to describe how to use
pyenv
to achieve a valid Python Venv environment without needing to use Anaconda at all.The previous
update.sh
script did not update OneTrainer if there were merge conflicts in the repository. It just continued onwards with the "reinstall pip dependencies" step as if nothing was wrong, even though the update hadn't been downloaded at all. We now abort the update process and let the user read the Git error message if there are any problems, so that they can see and manually resolve the merge conflicts in an appropriate way (such as by stashing or branching the local changes). This means that it's finally safe to update OneTrainer when you have local changes in the repository.New features in the new launch script framework:
All code is unified into a single library file,
lib.include.sh
, which is intentionally marked as non-executable (since it's only used by other scripts). There is no longer any fragile code duplication anywhere.All shell scripts are now only a few lines of code, as they import the library to achieve their tasks effortlessly. This also makes it incredibly easy to create additional shell scripts for the other OneTrainer tools, if desired.
The new library is written from the ground up to use modern best-practices and shell functions, as a modular and easily extensible framework for any future project requirements.
All script output is now clearly prefixed with "[OneTrainer]" to create visible separation between random third-party log output and the lines of text that come from OneTrainer's shell scripts.
The commands that we execute are now visibly displayed to the user, so that they can see exactly what the launch scripts are doing. This helps users and developers alike, by producing better action logs.
The pip handling is improved to now always use
pip
as a Python module, thus getting rid of the unreliablepip
binary.Before installing any requirements, we now always upgrade
pip
andsetuptools
to the newest versions, which often contain bug fixes. This change ensures the smoothest possible dependency installations.Environment variable handling has been completely overhauled, using best-practices for variable names, such as always using ALL_CAPS naming patterns and having a unique prefix to separate them from other variables. They are now all prefixed by
OT_
to avoid the risk of name clashes with system variables.All important features of the scripts are now configurable via environment variables (instead of having to edit the script files), all of which have new and improved defaults as well:
OT_CONDA_CMD
: Sets a custom Conda command or an absolute path to the binary (useful when it isn't in the user'sPATH
). If nothing is provided, we detect and useCONDA_EXE
, which is a variable that's set by Conda itself and always points at the user's installed Conda binary.OT_CONDA_ENV
: Sets the name of the Conda environment. Now defaults to the clear and purposeful "onetrainer", since "ot" was incredibly generic and could clash with people's existing Conda environments.OT_PYTHON_CMD
: Sets the Host's Python executable. It's used for creating the Python Venvs. This setting is mostly-useless, since the defaultpython
is correct for the host in pretty much 100% of all cases, but hey, it doesn't hurt to let people configure it.OT_PYTHON_VENV
: Sets the name (or absolute/relative path) of the Python Venv, and now defaults to.venv
(instead ofvenv
), which is the standard-practice for how to name venv directories. Furthermore, the new code fully supports spaces in the path, which is especially useful when venv is on another disk, such asOT_PYTHON_VENV="/home/user/My Projects/Envs/onetrainer"
, which is now a completely valid environment path.OT_PREFER_VENV
: If set to "true" (defaults to "false"), Conda will be ignored even if it exists on the system, and Python Venv will be used instead. This ensures that people who usepyenv
(to choose which Python version to run on the host) can now easily set up their desired Python Venv environments, without having to hack the launch scripts.OT_CUDA_LOWMEM_MODE
: If set to "true" (defaults to "false"), it enables aggressive garbage collection in PyTorch to help with low-memory GPUs. The variable name is now very clear.OT_PLATFORM_REQUIREMENTS
: Allows the user to override which platform-specific requirements.txt file they want to install. Defaults to "detect", which automatically detects whether you have an AMD or NVIDIA GPU. But people with multi-GPU systems can use this setting to force a specific GPU acceleration framework.OT_SCRIPT_DEBUG
: If set to "true" (defaults to "false"), it enables debug printing. Currently, there's no debug printing in the scripts, but there's aprint_debug
shell function which uses this variable and only prints to the screen if debugging is enabled. This ensures that debugging can easily be activated by script developers in the future.Closes Pull Requests: #474 and #466 and Issues: #417