Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new Mac/Linux launch script framework: modular, extensible and robust #477

Merged
merged 9 commits into from
Sep 29, 2024

Conversation

Arcitec
Copy link
Contributor

@Arcitec Arcitec commented Sep 26, 2024

Verified on all relevant Bash versions:

  • Linux Bash 5.2.26(1), a very recent release for Linux.

  • Mac Bash 3.2.57(1), since Apple uses an outdated Bash version due to GPL licensing changes in newer versions.


Fixes the following bugs and issues from the old scripts:

  • There was no error checking whatsoever, so if a command failed, the old scripts just happily continued with the next line. The new framework verifies the result of every executed line of code, and exits if there's even a single error.

  • Python version was being checked on the host and failed if the host's Python was wrong, instead of checking inside the Conda/Venv environment, which defeated the entire purpose of having a Conda/Venv in the first place. This check now verifies the actual environment, not the host.

  • The Conda environment check was very flawed. It searched for ot anywhere in the output of conda info --envs, meaning that if the letters "ot" appeared anywhere, it happily assumed that the OneTrainer Conda environment exists. For example, notcorrect /home/otheruser/foo would have incorrectly matched the old "ot" check. We now use a strict check instead, to ensure that the exact environment exists.

  • The old scripts checked for CUDA by looking for a developer binary, nvcc, which doesn't exist in normal NVIDIA CUDA drivers, thereby failing to detect CUDA on all modern systems. It has now been revised to look for either nvidia-smi (normal drivers) or nvcc (CUDA developer tools) to detect NVIDIA users. We could even have removed nvcc entirely, but it didn't hurt to keep it.

  • It failed to detect Conda at all if Conda's shell startup hook had executed, since their hook shadows conda into becoming a Shell function rather than a binary, which therefore failed the command -v conda check. That has now been corrected to accurately detect Conda's path regardless of circumstances.

  • The old method of launching Conda was absolutely nuts. It created a new sub-shell, sourced the .bashrc file to pretend to be an interactive user session, then ran conda activate followed by python. None of that was correct, and was extremely fragile (not to mention having zero error checking). The conda activate command is ONLY meant for user sessions, NOT for scripting, and its behavior is very unreliable. We now use the correct conda run shell scripting command instead!

  • The old method for "reinstalling requirements.txt" was incorrect. All it did was pip --force-reinstall, which just forces pip to reinstall its own, old, outdated, cached versions of the packages from disk, and tells it to reinstall them even if they were already installed. So all it did was waste a LOT of time, and still only upgraded requirements if the on-disk versions were no longer satisfying "requirements.txt" at all (such as if the minimum version constraint had been raised). It never updated deeper dependencies in the chain either, so if something like "PyTorch" depends on "numpy", and "numpy" depends on "otherlibrary without a version constraint", then "otherlibrary" was not updated since the old library on disk was still treated as "hey, the user has that library and the version constraint hasn't become invalid, so keep their old version. Now, all of that has been completely overhauled: We tell pip to eagerly upgrade every dependency to the latest versions that satisfy "requirements.txt", thereby ensuring that all libraries will be upgraded to the same versions as a fresh reinstall of "requirements.txt". A true upgrade. And it's also much, much faster, since it now only reinstalls libraries that have actually changed!

  • The old scripts did not handle the working-directory at all, which meant that the user had to manually cd OneTrainer before being able to run any of the shell scripts. This has now been fixed so that the working-directory is always the project directory, so that all resources can be found.

  • All of the old checks for executable binaries, venv directories, etc, used a mixture of a few modern and mostly very outdated Bash programming methods, and were therefore very fragile. For example, if the command -v lookup for a binary returned a path with spaces, then the old script's checks failed to find that binary at all.

  • Previous checks for the existence of a venv only looked for the directory, which could easily give false positives. We now check for the venv's bin/activate file instead, to be sure that the user's given venv path is truly a venv.

  • The old Python version check was very flimsy, executing two Python commands and checking each version component one by one in unreliable Bash code, and then printing two duplicated, subtly different error messages, instead of just checking both at once. This has now been completely overhauled to introduce a version check utility script (compatible with Python 2+), which takes the "minimum Python version" and "too high version" requirements and then verifies that the Python interpreter conforms to the desired version range. It supports MAJOR, MAJOR.MINOR and MAJOR.MINOR.PATCH version specifiers, to give developers complete flexibility to specify exactly which Python version OneTrainer needs. The Windows batch scripts should definitely be revised to use the same utility script. Lastly, we only print a single, unified and improved error message now.

  • The previous version check error message was recommending the huge, 3+ GB Conda, which contains around 2000 pre-installed scientific libraries, when Miniconda is much better. Miniconda is just the official package manager, which then installs exactly what you need on-demand instead of slowly pre-installing tons of bloat that you don't need. The error message has also been improved to describe how to use pyenv to achieve a valid Python Venv environment without needing to use Anaconda at all.

  • The previous update.sh script did not update OneTrainer if there were merge conflicts in the repository. It just continued onwards with the "reinstall pip dependencies" step as if nothing was wrong, even though the update hadn't been downloaded at all. We now abort the update process and let the user read the Git error message if there are any problems, so that they can see and manually resolve the merge conflicts in an appropriate way (such as by stashing or branching the local changes). This means that it's finally safe to update OneTrainer when you have local changes in the repository.


New features in the new launch script framework:

  • All code is unified into a single library file, lib.include.sh, which is intentionally marked as non-executable (since it's only used by other scripts). There is no longer any fragile code duplication anywhere.

  • All shell scripts are now only a few lines of code, as they import the library to achieve their tasks effortlessly. This also makes it incredibly easy to create additional shell scripts for the other OneTrainer tools, if desired.

  • The new library is written from the ground up to use modern best-practices and shell functions, as a modular and easily extensible framework for any future project requirements.

  • All script output is now clearly prefixed with "[OneTrainer]" to create visible separation between random third-party log output and the lines of text that come from OneTrainer's shell scripts.

  • The commands that we execute are now visibly displayed to the user, so that they can see exactly what the launch scripts are doing. This helps users and developers alike, by producing better action logs.

  • The pip handling is improved to now always use pip as a Python module, thus getting rid of the unreliable pip binary.

  • Before installing any requirements, we now always upgrade pip and setuptools to the newest versions, which often contain bug fixes. This change ensures the smoothest possible dependency installations.

  • Environment variable handling has been completely overhauled, using best-practices for variable names, such as always using ALL_CAPS naming patterns and having a unique prefix to separate them from other variables. They are now all prefixed by OT_ to avoid the risk of name clashes with system variables.

  • All important features of the scripts are now configurable via environment variables (instead of having to edit the script files), all of which have new and improved defaults as well:

    • OT_CONDA_CMD: Sets a custom Conda command or an absolute path to the binary (useful when it isn't in the user's PATH). If nothing is provided, we detect and use CONDA_EXE, which is a variable that's set by Conda itself and always points at the user's installed Conda binary.

    • OT_CONDA_ENV: Sets the name of the Conda environment. Now defaults to the clear and purposeful "onetrainer", since "ot" was incredibly generic and could clash with people's existing Conda environments.

    • OT_PYTHON_CMD: Sets the Host's Python executable. It's used for creating the Python Venvs. This setting is mostly-useless, since the default python is correct for the host in pretty much 100% of all cases, but hey, it doesn't hurt to let people configure it.

    • OT_PYTHON_VENV: Sets the name (or absolute/relative path) of the Python Venv, and now defaults to .venv (instead of venv), which is the standard-practice for how to name venv directories. Furthermore, the new code fully supports spaces in the path, which is especially useful when venv is on another disk, such as OT_PYTHON_VENV="/home/user/My Projects/Envs/onetrainer", which is now a completely valid environment path.

    • OT_PREFER_VENV: If set to "true" (defaults to "false"), Conda will be ignored even if it exists on the system, and Python Venv will be used instead. This ensures that people who use pyenv (to choose which Python version to run on the host) can now easily set up their desired Python Venv environments, without having to hack the launch scripts.

    • OT_CUDA_LOWMEM_MODE: If set to "true" (defaults to "false"), it enables aggressive garbage collection in PyTorch to help with low-memory GPUs. The variable name is now very clear.

    • OT_PLATFORM_REQUIREMENTS: Allows the user to override which platform-specific requirements.txt file they want to install. Defaults to "detect", which automatically detects whether you have an AMD or NVIDIA GPU. But people with multi-GPU systems can use this setting to force a specific GPU acceleration framework.

    • OT_SCRIPT_DEBUG: If set to "true" (defaults to "false"), it enables debug printing. Currently, there's no debug printing in the scripts, but there's a print_debug shell function which uses this variable and only prints to the screen if debugging is enabled. This ensures that debugging can easily be activated by script developers in the future.


Closes Pull Requests: #474 and #466 and Issues: #417

@Arcitec Arcitec force-pushed the new-launch-framework branch 4 times, most recently from 161cb4c to 45f86a3 Compare September 26, 2024 15:44
@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 26, 2024

Since I had some time left over while waiting on this, I implemented a few more improvements:

Added a Mac/Linux launcher script for custom CLI commands.

This introduces run-cmd.sh, an intelligent wrapper which automatically runs the desired OneTrainer CLI script with the given command-line arguments. It takes care of configuring the correct Python environment and finally ensures that users will have a safe way to run custom commands.

Example usage:

./run-cmd.sh train --config-path <path to your config>

Added detailed documentation for the new launcher scripts.

If someone edits the scripts in the future, they will need to ensure that the documentation stays up-to-date. But since the new framework is very flexible and robust, and Conda/Venv has a stable API which doesn't change over time, there's probably never going to be any need to change these scripts in the future.

@Arcitec Arcitec force-pushed the new-launch-framework branch from c54d180 to f9962de Compare September 26, 2024 16:03
@Arcitec Arcitec force-pushed the new-launch-framework branch 2 times, most recently from 1309508 to 148b1b5 Compare September 26, 2024 17:48
Verified on all relevant Bash versions:

- Linux Bash 5.2.26(1), a very recent release for Linux.

- Mac Bash 3.2.57(1), since Apple uses an outdated Bash version due to GPL licensing changes in newer versions.

---

Fixes the following bugs and issues from the old scripts:

- There was no error checking whatsoever, so if a command failed, the old scripts just happily continued with the next line. The new framework verifies the result of every executed line of code, and exits if there's even a single error.

- Python version was being checked on the host and failed if the host's Python was wrong, instead of checking inside the Conda/Venv environment, which defeated the entire purpose of having a Conda/Venv in the first place. This check now verifies the actual environment, not the host.

- The Conda environment check was very flawed. It searched for `ot` anywhere in the output of `conda info --envs`, meaning that if the letters "ot" appeared anywhere, it happily assumed that the OneTrainer Conda environment exists. For example, `notcorrect   /home/otheruser/foo` would have incorrectly matched the old "ot" check. We now use a strict check instead, to ensure that the exact environment exists.

- The old scripts checked for CUDA by looking for a developer binary, `nvcc`, which doesn't exist in normal NVIDIA CUDA drivers, thereby failing to detect CUDA on all modern systems. It has now been revised to look for either `nvidia-smi` (normal drivers) or `nvcc` (CUDA developer tools) to detect NVIDIA users. We could even have removed `nvcc` entirely, but it didn't hurt to keep it.

- It failed to detect Conda at all if Conda's shell startup hook had executed, since their hook shadows `conda` into becoming a Shell function rather than a binary, which therefore failed the `command -v conda` check. That has now been corrected to accurately detect Conda's path regardless of circumstances.

- The old method of launching Conda was absolutely nuts. It created a new sub-shell, sourced the `.bashrc` file to pretend to be an interactive user session, then ran `conda activate` followed by `python`. None of that was correct, and was extremely fragile (not to mention having zero error checking). The `conda activate` command is ONLY meant for user sessions, NOT for scripting, and its behavior is very unreliable. We now use the correct `conda run` shell scripting command instead!

- The old method for "reinstalling requirements.txt" was incorrect. All it did was `pip --force-reinstall`, which just forces pip to reinstall its own, old, outdated, cached versions of the packages from disk, and tells it to reinstall them even if they were already installed. So all it did was waste a LOT of time, and still only upgraded requirements if the on-disk versions were no longer satisfying "requirements.txt" at all (such as if the minimum version constraint had been raised). It never updated deeper dependencies in the chain either, so if something like "PyTorch" depends on "numpy", and "numpy" depends on "otherlibrary without a version constraint", then "otherlibrary" was not updated since the old library on disk was still treated as "hey, the user has that library and the version constraint hasn't become invalid, so keep their old version. Now, all of that has been completely overhauled: We tell pip to eagerly upgrade every dependency to the latest versions that satisfy "requirements.txt", thereby ensuring that all libraries will be upgraded to the same versions as a fresh reinstall of "requirements.txt". A true upgrade. And it's also much, much faster, since it now only reinstalls libraries that have actually changed!

- The old scripts did not handle the working-directory at all, which meant that the user had to manually `cd OneTrainer` before being able to run any of the shell scripts. This has now been fixed so that the working-directory is always the project directory, so that all resources can be found.

- All of the old checks for executable binaries, venv directories, etc, used a mixture of a few modern and mostly very outdated Bash programming methods, and were therefore very fragile. For example, if the `command -v` lookup for a binary returned a path with spaces, then the old script's checks failed to find that binary at all.

- Previous checks for the existence of a venv only looked for the directory, which could easily give false positives. We now check for the venv's `bin/activate` file instead, to be sure that the user's given venv path is truly a venv.

- The old Python version check was very flimsy, executing two Python commands and checking each version component one by one in unreliable Bash code, and then printing two duplicated, subtly different error messages, instead of just checking both at once. This has now been completely overhauled to introduce a version check utility script (compatible with Python 2+), which takes the "minimum Python version" and "too high version" requirements and then verifies that the Python interpreter conforms to the desired version range. It supports `MAJOR`, `MAJOR.MINOR` and `MAJOR.MINOR.PATCH` version specifiers, to give developers complete flexibility to specify exactly which Python version OneTrainer needs. The Windows batch scripts should definitely be revised to use the same utility script. Lastly, we only print a single, unified and improved error message now.

- The previous version check error message was recommending the huge, 3+ GB Conda, which contains around 2000 pre-installed scientific libraries, when Miniconda is much better. Miniconda is just the official package manager, which then installs exactly what you need on-demand instead of slowly pre-installing tons of bloat that you don't need. The error message has also been improved to describe how to use `pyenv` to achieve a valid Python Venv environment without needing to use Anaconda at all.

- The previous `update.sh` script did not update OneTrainer if there were merge conflicts in the repository. It just continued onwards with the "reinstall pip dependencies" step as if nothing was wrong, even though the update hadn't been downloaded at all. We now abort the update process and let the user read the Git error message if there are any problems, so that they can see and manually resolve the merge conflicts in an appropriate way (such as by stashing or branching the local changes). This means that it's finally safe to update OneTrainer when you have local changes in the repository.

---

New features in the new launch script framework:

- All code is unified into a single library file, `lib.include.sh`, which is intentionally marked as non-executable (since it's only used by other scripts). There is no longer any fragile code duplication anywhere.

- All shell scripts are now only a few lines of code, as they import the library to achieve their tasks effortlessly. This also makes it incredibly easy to create additional shell scripts for the other OneTrainer tools, if desired.

- The new library is written from the ground up to use modern best-practices and shell functions, as a modular and easily extensible framework for any future project requirements.

- All script output is now clearly prefixed with "[OneTrainer]" to create visible separation between random third-party log output and the lines of text that come from OneTrainer's shell scripts.

- The commands that we execute are now visibly displayed to the user, so that they can see exactly what the launch scripts are doing. This helps users and developers alike, by producing better action logs.

- The pip handling is improved to now always use `pip` as a Python module, thus getting rid of the unreliable `pip` binary.

- Before installing any requirements, we now always upgrade `pip` and `setuptools` to the newest versions, which often contain bug fixes. This change ensures the smoothest possible dependency installations.

- Environment variable handling has been completely overhauled, using best-practices for variable names, such as always using ALL_CAPS naming patterns and having a unique prefix to separate them from other variables. They are now all prefixed by `OT_` to avoid the risk of name clashes with system variables.

- All important features of the scripts are now configurable via environment variables (instead of having to edit the script files), all of which have new and improved defaults as well:

  * `OT_CONDA_CMD`: Sets a custom Conda command or an absolute path to the binary (useful when it isn't in the user's `PATH`). If nothing is provided, we detect and use `CONDA_EXE`, which is a variable that's set by Conda itself and always points at the user's installed Conda binary.

  * `OT_CONDA_ENV`: Sets the name of the Conda environment. Now defaults to the clear and purposeful "onetrainer", since "ot" was incredibly generic and could clash with people's existing Conda environments.

  * `OT_PYTHON_CMD`: Sets the Host's Python executable. It's used for creating the Python Venvs. This setting is mostly-useless, since the default `python` is correct for the host in pretty much 100% of all cases, but hey, it doesn't hurt to let people configure it.

  * `OT_PYTHON_VENV`: Sets the name (or absolute/relative path) of the Python Venv, and now defaults to `.venv` (instead of `venv`), which is the standard-practice for how to name venv directories. Furthermore, the new code fully supports spaces in the path, which is especially useful when venv is on another disk, such as `OT_PYTHON_VENV="/home/user/My Projects/Envs/onetrainer"`, which is now a completely valid environment path.

  * `OT_PREFER_VENV`: If set to "true" (defaults to "false"), Conda will be ignored even if it exists on the system, and Python Venv will be used instead. This ensures that people who use `pyenv` (to choose which Python version to run on the host) can now easily set up their desired Python Venv environments, without having to hack the launch scripts.

  * `OT_CUDA_LOWMEM_MODE`: If set to "true" (defaults to "false"), it enables aggressive garbage collection in PyTorch to help with low-memory GPUs. The variable name is now very clear.

  * `OT_PLATFORM_REQUIREMENTS`: Allows the user to override which platform-specific requirements.txt file they want to install. Defaults to "detect", which automatically detects whether you have an AMD or NVIDIA GPU. But people with multi-GPU systems can use this setting to force a specific GPU acceleration framework.

  * `OT_SCRIPT_DEBUG`: If set to "true" (defaults to "false"), it enables debug printing. Currently, there's no debug printing in the scripts, but there's a `print_debug` shell function which uses this variable and only prints to the screen if debugging is enabled. This ensures that debugging can easily be activated by script developers in the future.
- This introduces `run-cmd.sh`, an intelligent wrapper which automatically runs the desired OneTrainer CLI script with the given command-line arguments. It takes care of configuring the correct Python environment and finally ensures that users will have a safe way to run custom commands.

Example usage:

`./run-cmd.sh train --config-path <path to your config>`
@Arcitec Arcitec force-pushed the new-launch-framework branch from 148b1b5 to e5c1ab9 Compare September 26, 2024 18:00
@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 26, 2024

Don't mind the force-pushed changes. I've done some small revisions to the documentation for clarity, which can be seen in the "Compare" links above.

I always want to ensure that beginners can understand everything, since that cuts down on the amount of support tickets. :)

This fixes the issue where people may have integrated AMD graphics in their CPU along with a separate, dedicated NVIDIA GPU. By prioritizing NVIDIA, we ensure that the most likely dedicated GPU will be chosen in that scenario.
@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 27, 2024

Thanks to Calamdor on Discord for discovering that multi-GPU systems should be preferring NVIDIA GPU acceleration by default, since that's always expected to be the stronger GPU. The following fix has now been added:

Prioritize dedicated NVIDIA GPUs in multi-GPU systems

This fixes the issue where people may have integrated AMD graphics in their CPU along with a separate, dedicated NVIDIA GPU. By prioritizing NVIDIA, we ensure that the most likely dedicated GPU will be chosen in that scenario.

@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 27, 2024

Calamdor on Discord has now confirmed the following systems:

  • Microsoft WSL2 (Linux on Windows)
  • PopOS Linux with NVIDIA GPU in a multi-GPU system

This is in addition to my thorough tests where I went through every action multiple times with both environment backends (Conda and Venv):

  • Fedora Workstation 40 (Linux, Bash 5)
  • macOS (Bash 3)

@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 28, 2024

We just got confirmation from habshi on Discord:

Sorry, I forgot to say -- I tested this on Linux yesterday, and it worked beautifully. Mac fails for reasons unrelated to this PR -- bitsandbytes import error.

Yeah, the bitsandbytes library needs a rewrite for Macs, it's unrelated to the PR.

So that means we have the following extra confirmations:

  • Linux
  • macOS

That's 3 people across 6 operating environments, including Microsoft's WSL (was great to see that tested). If we've had enough confirmations now, we can go ahead with the merge. :)

@ForgetfulWasAria
Copy link

I just tried on my arch derivative, EndeavourOS and got a failure. I've installed python 3.10 and 3.11 from the AUR.

[aria@endeavouros OneTrainer]$ ./start-ui.sh 
[OneTrainer] Using Python Venv environment in ".venv"...
[OneTrainer] + python scripts/util/version_check.py 3 3.11
Error: Your Python version is too high: 3.12.5 (main, Aug  9 2024, 08:20:41) [GCC 14.2.1 20240805]. Must be >= 3 and < 3.11.
[OneTrainer] Solutions: Either install the required Python version via pyenv (https://github.com/pyenv/pyenv) and set the project directory's Python version with "pyenv install <version>" followed by "pyenv local <version>", or install Miniconda if you prefer that we automatically manage everything for you (https://docs.anaconda.com/miniconda/). Remember to manually delete any previous Venv or Conda environment which was created with a different Python version.
[aria@endeavouros OneTrainer]$ ls /usr/bin/python*
/usr/bin/python      /usr/bin/python3.10-config  /usr/bin/python3.12         /usr/bin/python-argcomplete-check-easy-install-script
/usr/bin/python3     /usr/bin/python3.11         /usr/bin/python3.12-config  /usr/bin/python-config
/usr/bin/python3.10  /usr/bin/python3.11-config  /usr/bin/python3-config

Honestly, I don't know if you'd want to worry with this. Anyone foolish enough to use Arch should know what they're doing. Debian Bookworm ships 3.11 and Fedora has 3.10, so it's mostly Arch and the less popular distros.

Would checking for python 3.10 in path work with pyenv as well?

@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 28, 2024

@ForgetfulWasAria Hey thanks for the feedback. That's a user error and the error explains what's going on. :)

Error: Your Python version is too high: 3.12.5 (main, Aug 9 2024, 08:20:41) [GCC 14.2.1 20240805]. Must be >= 3 and < 3.11.

[OneTrainer] Solutions: Either install the required Python version via pyenv (https://github.com/pyenv/pyenv) and set the project directory's Python version with "pyenv install " followed by "pyenv local ", or install Miniconda if you prefer that we automatically manage everything for you (https://docs.anaconda.com/miniconda/). Remember to manually delete any previous Venv or Conda environment which was created with a different Python version.

It means that your default python command is running Python 3.12.5.

You have these solutions:

  • Install Miniconda. It will then detect Conda on your system and use such an environment instead.
  • Install Pyenv and run the suggested commands to tell the OneTrainer directory to use Python 3.10. This makes Python 3.10 permanent for OneTrainer regardless of what else is on the host system. It works by installing hooks that automatically redirect python and pip to the appropriate binary for the current directory you're running in, so it lets you set per-directory Python versions. It's the recommended way to manage multiple Python versions on Linux, because it gives total control and a reliable Python version, with each being nicely isolated from the host's packages, and there's never any risk that the host's system updates would uninstall that Python version. Another nice aspect of it is that you only need to run the pyenv local 3.10 command once in the OneTrainer directory and it will then permanently use that version in all commands.
  • Alternatively, you can try this: env OT_PYTHON_CMD="python3.10" ./start-ui.sh which will tell it to use the system's python3.10 binary (which is visible in the list you showed me) when creating the Python Venv. In fact I would be very interested to see your test result with that trick! After the .venv has been created with 3.10, I suspect that the commands will all run without needing that variable, since the .venv itself will be linked to python3.10. [1] (Edit: This trick works perfectly. See the next message.)

In either case, you will need to do rm -rfv .venv to delete the old Venv that got created with Python 3.12. :)


[1] For example, here's what I see when forcing a venv to be created with my host's python3.12 binary:

$ python3.12 -m venv .venv

$ ls -l .venv/bin 
total 48
-rw-r--r--. 1 johnny johnny 2040 Sep 28 17:00 activate
-rw-r--r--. 1 johnny johnny  920 Sep 28 17:00 activate.csh
-rw-r--r--. 1 johnny johnny 2199 Sep 28 17:00 activate.fish
-rw-r--r--. 1 johnny johnny 9033 Sep 28 17:00 Activate.ps1
-rwxr-xr-x. 1 johnny johnny  246 Sep 28 17:00 pip
-rwxr-xr-x. 1 johnny johnny  246 Sep 28 17:00 pip3
-rwxr-xr-x. 1 johnny johnny  246 Sep 28 17:00 pip3.12
lrwxrwxrwx. 1 johnny johnny   10 Sep 28 17:00 python -> python3.12
lrwxrwxrwx. 1 johnny johnny   10 Sep 28 17:00 python3 -> python3.12
lrwxrwxrwx. 1 johnny johnny   19 Sep 28 17:00 python3.12 -> /usr/bin/python3.12

As you can see, the third method should work for you too, since it will then link the venv's python binary to python3.10 (in your case).

I would always recommend Pyenv though. With Pyenv there's never any risk that the Python package gets removed by the host's system updates. But I suspect that Arch will keep the Python3.10 binary for years, so feel free to use the third method!

@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 28, 2024

Alright, because I was curious, I can confirm that the third method works if you prefer to do that.

You only need to specify the OT_PYTHON_CMD on the first run. After that, the Python Venv is created with that Python version, and all subsequent commands will run with that version.

A run where I let it use my system's default python (3.12) and got the warning:

$ env OT_PREFER_VENV="true" ./start-ui.sh
[OneTrainer] Using Python Venv environment in ".venv"...
[OneTrainer] Creating Python Venv environment in ".venv"...
[OneTrainer] + python -m venv .venv
[OneTrainer] + python scripts/util/version_check.py 3 3.11
Error: Your Python version is too high: 3.12.6 (main, Sep  9 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)]. Must be >= 3 and < 3.11.
[OneTrainer] Solutions: Either install the required Python version via pyenv (https://github.com/pyenv/pyenv) and set the project directory's Python version with "pyenv install <version>" followed by "pyenv local <version>", or install Miniconda if you prefer that we automatically manage everything for you (https://docs.anaconda.com/miniconda/). Remember to manually delete any previous Venv or Conda environment which was created with a different Python version.

I remove the Python 3.12 Venv:

$ rm -rf .venv

Now I force it to use python3.10 on my host when creating the Venv (it doesn't matter which command you run, can be install.sh or any of the others; what matters is that you specify OT_PYTHON_CMD on the first run):

$ env OT_PREFER_VENV="true" OT_PYTHON_CMD="python3.10" ./start-ui.sh
[OneTrainer] Using Python Venv environment in ".venv"...
[OneTrainer] Creating Python Venv environment in ".venv"...
[OneTrainer] + python3.10 -m venv .venv
[OneTrainer] + python scripts/util/version_check.py 3 3.11
[OneTrainer] Installing requirements in active environment...
[OneTrainer] + python -m pip install --upgrade --upgrade-strategy eager pip setuptools
^CERROR: Operation cancelled by user

Now all subsequent runs will use the Python 3.10 venv, so you no longer need to specify the version. I personally still need to specify OT_PREFER_VENV="true" since I have Conda on the system, and the scripts always prefer Conda if it exists since that's more noob-friendly:

$ env OT_PREFER_VENV="true" ./start-ui.sh 
[OneTrainer] Using Python Venv environment in ".venv"...
[OneTrainer] + python scripts/util/version_check.py 3 3.11
[OneTrainer] + python scripts/train_ui.py

So feel free to use the OT_PYTHON_CMD="python3.10" technique on your first run, and you'll be done! :) This tells it to use the host's Python 3.10 binary forever.

Important notice: For most people, I would highly recommend Conda (for absolute beginners) and Pyenv (requires slightly more setup but is super reliable). I don't recommend using the host's python binary since you never know if it will stay the same version or get a breaking update, or even get removed by a system update.

@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 28, 2024

@Nerogar I'll add a small update to the docs to mention the host version override trick in more detail. After that, it's ready for merge.

@ForgetfulWasAria
Copy link

I understand what's going on. The issue is that "bleeding edge" distros generally don't default to python 3.10. Arch (and I think gentoo) have packages that allow installing python 3.10 with the binary name python3.10 and it might be worthwhile to add a check for that specific name along with python3.11 as valid versions.

Personally, since the various AI packages all need different python versions, I tend to create the venv manually so that I know it has the needed python version.

So you could:

  • Add a check for the python3.10 and python3.11 binaries.
  • OR: Just assume that anyone using arch/gentoo should know how to install older python versions
    Both of these are reasonable and the second is certainly less work :)

And I can also confirm that adding the environment variable does work.

Thanks!

@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 28, 2024

it might be worthwhile to add a check for that specific name along with python3.11 as valid versions.

It's generally a bad and flaky idea to use the system's Python binaries, so that check will not be added. Conda and Pyenv are the only ways to guarantee a working Python version that won't randomly change or get deleted or have weird distro-specific tweaks.

But I will document the override trick in more detail now and mention that advanced users can use it. Thanks for trying it out! :)

I'll also add a written guide section that describes Conda and Pyenv in a bit more detail, rather than only mentioning them in the error message.

This makes the setup instructions easier to follow, since they were previously only available as a short error message (which is displayed by the launch scripts whenever the user's Python version is incorrect).
We now instruct the users about how to upgrade their Conda environment to the required Python version, whenever their system contains an old environment.
@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 28, 2024

Added a Python version setup guide

This makes the setup instructions easier to follow, since they were previously only available as a short error message (which is displayed by the launch scripts whenever the user's Python version is incorrect).

The new section can be previewed here.

Implemented Conda upgrade prompts for old Python environments

We now instruct the users about how to upgrade their Conda environment to the required Python version, whenever their system contains an old environment.

Here's an example where I forced my onetrainer environment to use an incompatible Python version, to demonstrate the improved instructions in that scenario:

$ conda create -y -n onetrainer python=3.12
[...]


$ ./start-ui.sh 
[OneTrainer] Using Conda environment with name "onetrainer"...
[OneTrainer] + /home/johnny/.local/share/miniconda/bin/conda run -n onetrainer --no-capture-output python scripts/util/version_check.py 3 3.11
Error: Your Python version is too high: 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0]. Must be >= 3 and < 3.11.
ERROR conda.cli.main_run:execute(125): `conda run python scripts/util/version_check.py 3 3.11` failed. (See above for error)
[OneTrainer] Solution: Switch your Conda environment to the required Python version by deleting your old environment, and then run OneTrainer again.

To delete the outdated Conda environment, execute the following command:
"/home/johnny/.local/share/miniconda/bin/conda" remove -y -n "onetrainer" --all


$ "/home/johnny/.local/share/miniconda/bin/conda" remove -y -n "onetrainer" --all
[...]


$ ./start-ui.sh                                                                  
[OneTrainer] Using Conda environment with name "onetrainer"...
[OneTrainer] Creating Conda environment with name "onetrainer"...
[OneTrainer] + /home/johnny/.local/share/miniconda/bin/conda create -y -n onetrainer python==3.10
[...]

There are no breaking changes. We can go ahead with the merge. :) @Nerogar

@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 28, 2024

@Nerogar Hold one moment. I noticed that Conda is a bit stupid. If you say "Python 3.10" then it installs 3.10.0, meaning it always uses the oldest .0 releases which are full of bugs. I will improve the Python version installer for Conda to always use the latest bugfix release of Python.

Previously, Conda always used the ".0" release of the desired Python version (such as "3.10.0"), which is always full of bugs.

We now specify that we want the latest bugfix/patch release of the required Python version.

People who use pyenv (instead of Conda) don't need to worry about this change, since pyenv always installs the latest bugfix releases by default.
@Arcitec
Copy link
Contributor Author

Arcitec commented Sep 28, 2024

Always use the latest bugfix releases of Python for Conda environments

Previously, Conda always used the ".0" release of the desired Python version (such as "3.10.0"), which is always full of bugs.

We now specify that we want the latest bugfix/patch release of the required Python version.

People who use pyenv (instead of Conda) don't need to worry about this change, since pyenv always installs the latest bugfix releases by default.


This is yet again a non-breaking change, so the merge is finally ready! :) @Nerogar

@Arcitec Arcitec force-pushed the new-launch-framework branch from dd150ff to edebb64 Compare September 28, 2024 18:00
@Arcitec
Copy link
Contributor Author

Arcitec commented Jan 17, 2025

I recently provided a deeper explanation of pip install --upgrade --upgrade-strategy eager in another discussion, and am reposting it here to make it easy to find in the future since this is the pull request that brought in that feature, and is the only search result for "eager" in our tickets/PRs.

It's basically an explanation of why we do it, why it was necessary, and what it solves.


Historically, pip always upgraded to the newest versions that still satisfy the requirements.

But since many people install dependencies manually via pip install numpy==1.0; pip install diffusers etc, this led to users breaking their environments, because later installs of other packages would upgrade things that had been manually installed "as intentionally lower versions" earlier (so if diffusers in this example needed numpy>=1.0, the user's numpy would get upgraded when installing diffusers, even though they had previously requested exactly numpy 1.0).

So around 10 years ago, the decision was made to make pip "lazy by default". Nowadays, it keeps the currently installed package version, or installs outdated locally cached downloaded packages on fresh pip install calls, as long as what it already has locally satisfies the general requirement version range. This means that "pip install" by default doesn't upgrade anything, in most cases. Not even during fresh installs (if there's an older cached version of a package). In fact, the pip installer is so lazy that if the user manually installed numpy==1.0 into their project, and a dependency needs numpy>=1.2, and they locally have 1.2.1 in the pip package download cache, but the internet has 1.3.0, then it will install the locally cached (outdated) 1.2.1 even during "fresh" installs.

(Other Python package managers don't have this problem at all, by the way. They usually keep a "pinned packages" history of exactly what has been manually installed in the past, and will update everything to the newest versions that satisfies all old, manual installs. But since pip is a basic package manager, they had to make their defaults reasonable for beginners.)

To still allow people to install actual package updates to fix bugs, they also added a custom --upgrade-strategy eager flag, for when you want to actually upgrade the dependency tree to the best possible state and know what you're doing. The correct way to use that flag is to provide it and a list of all dependencies and requirements.txt files at the same time, in a single command, so that the resolver can check the whole graph and get the best versions of everything, while always ensuring that every package has been satisfied. (The only thing that should be done separately is upgrades of pip's own core package tools: pip and setuptools itself, which should always be upgraded to their newest versions first, before all other commands, so that their bugfixes apply to the rest of the installs).

That's the strategy we use. You can find it with install --upgrade --upgrade-strategy eager in lib.include.sh. And the same fix has been brought over to the Windows update.bat too.

So when users run our ./update.sh, they get the freshest, best possible, most-fixed versions of all dependencies, while still ensuring that the whole dependency graph is happy. It's a true update and gets the same versions as if a user did a totally fresh install without any local pip package cache or environment. Which means that when bugs are fixed in dependencies of our dependencies, we get those fixes when we update. :) It's also really important for us as developers (and for our users), since it means that we all use the same versions after updating, so that we don't all have different bugs in our OneTrainer installs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants