Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
new Mac/Linux launch script framework: modular, extensible and robust
Verified on all relevant Bash versions: - Linux Bash 5.2.26(1), a very recent release for Linux. - Mac Bash 3.2.57(1), since Apple uses an outdated Bash version due to GPL licensing changes in newer versions. --- Fixes the following bugs and issues from the old scripts: - There was no error checking whatsoever, so if a command failed, the old scripts just happily continued with the next line. The new framework verifies the result of every executed line of code, and exits if there's even a single error. - Python version was being checked on the host and failed if the host's Python was wrong, instead of checking inside the Conda/Venv environment, which defeated the entire purpose of having a Conda/Venv in the first place. This check now verifies the actual environment, not the host. - The Conda environment check was very flawed. It searched for `ot` anywhere in the output of `conda info --envs`, meaning that if the letters "ot" appeared anywhere, it happily assumed that the OneTrainer Conda environment exists. For example, `notcorrect /home/otheruser/foo` would have incorrectly matched the old "ot" check. We now use a strict check instead, to ensure that the exact environment exists. - The old scripts checked for CUDA by looking for a developer binary, `nvcc`, which doesn't exist in normal NVIDIA CUDA drivers, thereby failing to detect CUDA on all modern systems. It has now been revised to look for either `nvidia-smi` (normal drivers) or `nvcc` (CUDA developer tools) to detect NVIDIA users. We could even have removed `nvcc` entirely, but it didn't hurt to keep it. - It failed to detect Conda at all if Conda's shell startup hook had executed, since their hook shadows `conda` into becoming a Shell function rather than a binary, which therefore failed the `command -v conda` check. That has now been corrected to accurately detect Conda's path regardless of circumstances. - The old method of launching Conda was absolutely nuts. It created a new sub-shell, sourced the `.bashrc` file to pretend to be an interactive user session, then ran `conda activate` followed by `python`. None of that was correct, and was extremely fragile (not to mention having zero error checking). The `conda activate` command is ONLY meant for user sessions, NOT for scripting, and its behavior is very unreliable. We now use the correct `conda run` shell scripting command instead! - The old method for "reinstalling requirements.txt" was incorrect. All it did was `pip --force-reinstall`, which just forces pip to reinstall its own, old, outdated, cached versions of the packages from disk, and tells it to reinstall them even if they were already installed. So all it did was waste a LOT of time, and still only upgraded requirements if the on-disk versions were no longer satisfying "requirements.txt" at all (such as if the minimum version constraint had been raised). It never updated deeper dependencies in the chain either, so if something like "PyTorch" depends on "numpy", and "numpy" depends on "otherlibrary without a version constraint", then "otherlibrary" was not updated since the old library on disk was still treated as "hey, the user has that library and the version constraint hasn't become invalid, so keep their old version. Now, all of that has been completely overhauled: We tell pip to eagerly upgrade every dependency to the latest versions that satisfy "requirements.txt", thereby ensuring that all libraries will be upgraded to the same versions as a fresh reinstall of "requirements.txt". A true upgrade. And it's also much, much faster, since it now only reinstalls libraries that have actually changed! - The old scripts did not handle the working-directory at all, which meant that the user had to manually `cd OneTrainer` before being able to run any of the shell scripts. This has now been fixed so that the working-directory is always the project directory, so that all resources can be found. - All of the old checks for executable binaries, venv directories, etc, used a mixture of a few modern and mostly very outdated Bash programming methods, and were therefore very fragile. For example, if the `command -v` lookup for a binary returned a path with spaces, then the old script's checks failed to find that binary at all. - Previous checks for the existence of a venv only looked for the directory, which could easily give false positives. We now check for the venv's `bin/activate` file instead, to be sure that the user's given venv path is truly a venv. - The old Python version check was very flimsy, executing two Python commands and checking each version component one by one in unreliable Bash code, and then printing two duplicated, subtly different error messages, instead of just checking both at once. This has now been completely overhauled to introduce a version check utility script (compatible with Python 2+), which takes the "minimum Python version" and "too high version" requirements and then verifies that the Python interpreter conforms to the desired version range. It supports `MAJOR`, `MAJOR.MINOR` and `MAJOR.MINOR.PATCH` version specifiers, to give developers complete flexibility to specify exactly which Python version OneTrainer needs. The Windows batch scripts should definitely be revised to use the same utility script. Lastly, we only print a single, unified and improved error message now. - The previous version check error message was recommending the huge, 3+ GB Conda, which contains around 2000 pre-installed scientific libraries, when Miniconda is much better. Miniconda is just the official package manager, which then installs exactly what you need on-demand instead of slowly pre-installing tons of bloat that you don't need. The error message has also been improved to describe how to use `pyenv` to achieve a valid Python Venv environment without needing to use Anaconda at all. --- New features in the new launch script framework: - All code is unified into a single library file, `lib.include.sh`, which is intentionally marked as non-executable (since it's only used by other scripts). There is no longer any fragile code duplication anywhere. - All shell scripts are now only a few lines of code, as they import the library to achieve their tasks effortlessly. This also makes it incredibly easy to create additional shell scripts for the other OneTrainer tools, if desired. - The new library is written from the ground up to use modern best-practices and shell functions, as a modular and easily extensible framework for any future project requirements. - All script output is now clearly prefixed with "[OneTrainer]" to create visible separation between random third-party log output and the lines of text that come from OneTrainer's shell scripts. - The commands that we execute are now visibly displayed to the user, so that they can see exactly what the launch scripts are doing. This helps users and developers alike, by producing better action logs. - The pip handling is improved to now always use `pip` as a Python module, thus getting rid of the unreliable `pip` binary. - Before installing any requirements, we now always upgrade `pip` and `setuptools` to the newest versions, which often contain bug fixes. This change ensures the smoothest possible dependency installations. - Environment variable handling has been completely overhauled, using best-practices for variable names, such as always using ALL_CAPS naming patterns and having a unique prefix to separate them from other variables. They are now all prefixed by `OT_` to avoid the risk of name clashes with system variables. - All important features of the scripts are now configurable via environment variables (instead of having to edit the script files), all of which have new and improved defaults as well: * `OT_CONDA_CMD`: Sets a custom Conda command or an absolute path to the binary (useful when it isn't in the user's `PATH`). If nothing is provided, we detect and use `CONDA_EXE`, which is a variable that's set by Conda itself and always points at the user's installed Conda binary. * `OT_CONDA_ENV`: Sets the name of the Conda environment. Now defaults to the clear and purposeful "onetrainer", since "ot" was incredibly generic and could clash with people's existing Conda environments. * `OT_PYTHON_CMD`: Sets the Host's Python executable. It's used for creating the Python Venvs. This setting is mostly-useless, since the default `python` is correct for the host in pretty much 100% of all cases, but hey, it doesn't hurt to let people configure it. * `OT_PYTHON_VENV`: Sets the name (or absolute/relative path) of the Python Venv, and now defaults to `.venv` (instead of `venv`), which is the standard-practice for how to name venv directories. Furthermore, the new code fully supports spaces in the path, which is especially useful when venv is on another disk, such as `OT_PYTHON_VENV="/home/user/My Projects/Envs/onetrainer"`, which is now a completely valid environment path. * `OT_PREFER_VENV`: If set to "true" (defaults to "false"), Conda will be ignored even if it exists on the system, and Python Venv will be used instead. This ensures that people who use `pyenv` (to choose which Python version to run on the host) can now easily set up their desired Python Venv environments, without having to hack the launch scripts. * `OT_CUDA_LOWMEM_MODE`: If set to "true" (defaults to "false"), it enables aggressive garbage collection in PyTorch to help with low-memory GPUs. The variable name is now very clear. * `OT_PLATFORM_REQUIREMENTS`: Allows the user to override which platform-specific requirements.txt file they want to install. Defaults to "detect", which automatically detects whether you have an AMD or NVIDIA GPU. But people with multi-GPU systems can use this setting to force a specific GPU acceleration framework. * `OT_SCRIPT_DEBUG`: If set to "true" (defaults to "false"), it enables debug printing. Currently, there's no debug printing in the scripts, but there's a `print_debug` shell function which uses this variable and only prints to the screen if debugging is enabled. This ensures that debugging can easily be activated by script developers in the future.
- Loading branch information