new Mac/Linux launch script framework: modular, extensible and robust

Verified on all relevant Bash versions: - Linux Bash 5.2.26(1), a very recent release for Linux. - Mac Bash 3.2.57(1), since Apple uses an outdated Bash version due to GPL licensing changes in newer versions. --- Fixes the following bugs and issues from the old scripts: - There was no error checking whatsoever, so if a command failed, the old scripts just happily continued with the next line. The new framework verifies the result of every executed line of code, and exits if there's even a single error. - Python version was being checked on the host and failed if the host's Python was wrong, instead of checking inside the Conda/Venv environment, which defeated the entire purpose of having a Conda/Venv in the first place. This check now verifies the actual environment, not the host. - The Conda environment check was very flawed. It searched for `ot` anywhere in the output of `conda info --envs`, meaning that if the letters "ot" appeared anywhere, it happily assumed that the OneTrainer Conda environment exists. For example, `notcorrect /home/otheruser/foo` would have incorrectly matched the old "ot" check. We now use a strict check instead, to ensure that the exact environment exists. - The old scripts checked for CUDA by looking for a developer binary, `nvcc`, which doesn't exist in normal NVIDIA CUDA drivers, thereby failing to detect CUDA on all modern systems. It has now been revised to look for either `nvidia-smi` (normal drivers) or `nvcc` (CUDA developer tools) to detect NVIDIA users. We could even have removed `nvcc` entirely, but it didn't hurt to keep it. - It failed to detect Conda at all if Conda's shell startup hook had executed, since their hook shadows `conda` into becoming a Shell function rather than a binary, which therefore failed the `command -v conda` check. That has now been corrected to accurately detect Conda's path regardless of circumstances. - The old method of launching Conda was absolutely nuts. It created a new sub-shell, sourced the `.bashrc` file to pretend to be an interactive user session, then ran `conda activate` followed by `python`. None of that was correct, and was extremely fragile (not to mention having zero error checking). The `conda activate` command is ONLY meant for user sessions, NOT for scripting, and its behavior is very unreliable. We now use the correct `conda run` shell scripting command instead! - The old method for "reinstalling requirements.txt" was incorrect. All it did was `pip --force-reinstall`, which just forces pip to reinstall its own, old, outdated, cached versions of the packages from disk, and tells it to reinstall them even if they were already installed. So all it did was waste a LOT of time, and still only upgraded requirements if the on-disk versions were no longer satisfying "requirements.txt" at all (such as if the minimum version constraint had been raised). It never updated deeper dependencies in the chain either, so if something like "PyTorch" depends on "numpy", and "numpy" depends on "otherlibrary without a version constraint", then "otherlibrary" was not updated since the old library on disk was still treated as "hey, the user has that library and the version constraint hasn't become invalid, so keep their old version. Now, all of that has been completely overhauled: We tell pip to eagerly upgrade every dependency to the latest versions that satisfy "requirements.txt", thereby ensuring that all libraries will be upgraded to the same versions as a fresh reinstall of "requirements.txt". A true upgrade. And it's also much, much faster, since it now only reinstalls libraries that have actually changed! - The old scripts did not handle the working-directory at all, which meant that the user had to manually `cd OneTrainer` before being able to run any of the shell scripts. This has now been fixed so that the working-directory is always the project directory, so that all resources can be found. - All of the old checks for executable binaries, venv directories, etc, used a mixture of a few modern and mostly very outdated Bash programming methods, and were therefore very fragile. For example, if the `command -v` lookup for a binary returned a path with spaces, then the old script's checks failed to find that binary at all. - Previous checks for the existence of a venv only looked for the directory, which could easily give false positives. We now check for the venv's `bin/activate` file instead, to be sure that the user's given venv path is truly a venv. - The old Python version check was very flimsy, executing two Python commands and checking each version component one by one in unreliable Bash code, and then printing two duplicated, subtly different error messages, instead of just checking both at once. This has now been completely overhauled to introduce a version check utility script (compatible with Python 2+), which takes the "minimum Python version" and "too high version" requirements and then verifies that the Python interpreter conforms to the desired version range. It supports `MAJOR`, `MAJOR.MINOR` and `MAJOR.MINOR.PATCH` version specifiers, to give developers complete flexibility to specify exactly which Python version OneTrainer needs. The Windows batch scripts should definitely be revised to use the same utility script. Lastly, we only print a single, unified and improved error message now. - The previous version check error message was recommending the huge, 3+ GB Conda, which contains around 2000 pre-installed scientific libraries, when Miniconda is much better. Miniconda is just the official package manager, which then installs exactly what you need on-demand instead of slowly pre-installing tons of bloat that you don't need. The error message has also been improved to describe how to use `pyenv` to achieve a valid Python Venv environment without needing to use Anaconda at all. --- New features in the new launch script framework: - All code is unified into a single library file, `lib.include.sh`, which is intentionally marked as non-executable (since it's only used by other scripts). There is no longer any fragile code duplication anywhere. - All shell scripts are now only a few lines of code, as they import the library to achieve their tasks effortlessly. This also makes it incredibly easy to create additional shell scripts for the other OneTrainer tools, if desired. - The new library is written from the ground up to use modern best-practices and shell functions, as a modular and easily extensible framework for any future project requirements. - All script output is now clearly prefixed with "[OneTrainer]" to create visible separation between random third-party log output and the lines of text that come from OneTrainer's shell scripts. - The commands that we execute are now visibly displayed to the user, so that they can see exactly what the launch scripts are doing. This helps users and developers alike, by producing better action logs. - The pip handling is improved to now always use `pip` as a Python module, thus getting rid of the unreliable `pip` binary. - Before installing any requirements, we now always upgrade `pip` and `setuptools` to the newest versions, which often contain bug fixes. This change ensures the smoothest possible dependency installations. - Environment variable handling has been completely overhauled, using best-practices for variable names, such as always using ALL_CAPS naming patterns and having a unique prefix to separate them from other variables. They are now all prefixed by `OT_` to avoid the risk of name clashes with system variables. - All important features of the scripts are now configurable via environment variables (instead of having to edit the script files), all of which have new and improved defaults as well: * `OT_CONDA_CMD`: Sets a custom Conda command or an absolute path to the binary (useful when it isn't in the user's `PATH`). If nothing is provided, we detect and use `CONDA_EXE`, which is a variable that's set by Conda itself and always points at the user's installed Conda binary. * `OT_CONDA_ENV`: Sets the name of the Conda environment. Now defaults to the clear and purposeful "onetrainer", since "ot" was incredibly generic and could clash with people's existing Conda environments. * `OT_PYTHON_CMD`: Sets the Host's Python executable. It's used for creating the Python Venvs. This setting is mostly-useless, since the default `python` is correct for the host in pretty much 100% of all cases, but hey, it doesn't hurt to let people configure it. * `OT_PYTHON_VENV`: Sets the name (or absolute/relative path) of the Python Venv, and now defaults to `.venv` (instead of `venv`), which is the standard-practice for how to name venv directories. Furthermore, the new code fully supports spaces in the path, which is especially useful when venv is on another disk, such as `OT_PYTHON_VENV="/home/user/My Projects/Envs/onetrainer"`, which is now a completely valid environment path. * `OT_PREFER_VENV`: If set to "true" (defaults to "false"), Conda will be ignored even if it exists on the system, and Python Venv will be used instead. This ensures that people who use `pyenv` (to choose which Python version to run on the host) can now easily set up their desired Python Venv environments, without having to hack the launch scripts. * `OT_CUDA_LOWMEM_MODE`: If set to "true" (defaults to "false"), it enables aggressive garbage collection in PyTorch to help with low-memory GPUs. The variable name is now very clear. * `OT_PLATFORM_REQUIREMENTS`: Allows the user to override which platform-specific requirements.txt file they want to install. Defaults to "detect", which automatically detects whether you have an AMD or NVIDIA GPU. But people with multi-GPU systems can use this setting to force a specific GPU acceleration framework. * `OT_SCRIPT_DEBUG`: If set to "true" (defaults to "false"), it enables debug printing. Currently, there's no debug printing in the scripts, but there's a `print_debug` shell function which uses this variable and only prints to the screen if debugging is enabled. This ensures that debugging can easily be activated by script developers in the future.
Nerogar · Sep 26, 2024 · 45f86a3 · 45f86a3
1 parent e60e583
commit 45f86a3
Show file tree

Hide file tree

Showing 6 changed files with 365 additions and 178 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,6 @@
 .idea
+.python-version
+/.venv*
 /venv*
 /debug*
 /workspace*

diff --git a/install.sh b/install.sh
@@ -1,67 +1,7 @@
-#!/bin/bash
+#!/usr/bin/env bash
 
-# Let user specify python and venv directly.
-if [[ -z "${python_cmd}" ]]; then
-    python_cmd="python"
-fi
-if [[ -z "${python_venv}" ]]; then
-    python_venv=venv
-fi
+set -e
 
-#change the environment name for conda to use
-conda_env=ot
-#change the environment name for python to use (only needed if Anaconda3 or miniconda is not installed)
+source "${BASH_SOURCE[0]%/*}/lib.include.sh"
 
-if [ -e /dev/kfd ]; then
-	PLATFORM_REQS=requirements-rocm.txt
-elif [ -x "$(command -v nvcc)" ]; then
-	PLATFORM_REQS=requirements-cuda.txt
-else
-	PLATFORM_REQS=requirements-default.txt
-fi
-
-if ! [ -x "$(command -v ${python_cmd})" ]; then
-	echo 'error: python not installed or found!'
-elif [ -x "$(command -v ${python_cmd})" ]; then
-	major=$(${python_cmd} -c 'import platform; major, minor, patch = platform.python_version_tuple(); print(major)')
-	minor=$(${python_cmd} -c 'import platform; major, minor, patch = platform.python_version_tuple(); print(minor)')
-
-	#check major version of python
-	if [[ "$major" -eq "3" ]];
-		then
-			#check minor version of python
-			if [[ "$minor" -le "10" ]];
-				then
-					if ! [ -x "$(command -v conda)" ]; then
-						echo 'conda not found; python version correct; use native python'
-						if ! [ -d $python_venv ]; then
-							${python_cmd} -m venv $python_venv
-						fi
-						source $python_venv/bin/activate
-						if [[ -z "$VIRTUAL_ENV" ]]; then
-    							echo "warning: No VIRTUAL_ENV set. exiting."
-						else
-    							${python_cmd} -m pip install -r requirements-global.txt -r $PLATFORM_REQS
-						fi
-					elif [ -x "$(command -v conda)" ]; then
-						#check for venv
-						if conda info --envs | grep -q ${conda_env}; 
-							then
-								bash --init-file <(echo ". \"$HOME/.bashrc\"; conda activate $conda_env; ${python_cmd} -m pip install -r requirements-global.txt -r $PLATFORM_REQS")
-							else 
-								conda create -y -n $conda_env python==3.10;
-								bash --init-file <(echo ". \"$HOME/.bashrc\"; conda activate $conda_env; ${python_cmd} -m pip install -r requirements-global.txt -r $PLATFORM_REQS")
-						fi
-					fi
-				else
-					echo 'error: wrong python version installed:'$major'.'$minor
-					echo 'OneTrainer requires the use of python 3.10, please refer to the anaconda project to setup a virtual environment with that version. https://anaconda.org/anaconda/python'
-			fi
-		else
-			echo 'error: wrong python version installed:'$major'.'$minor
-			echo 'OneTrainer requires the use of python 3.10, either install python3 on your system or refer to the anaconda project to setup a virtual environment with that version. https://anaconda.org/anaconda/python'
-	fi
-fi
-
-#create workdirs
-#TODO
+prepare_runtime_environment
diff --git a/lib.include.sh b/lib.include.sh
@@ -0,0 +1,303 @@
+#!/usr/bin/env bash
+
+set -e
+
+# Detect absolute path to the directory where "lib.include.sh" resides.
+export SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
+
+# Guard against including the library multiple times.
+readonly SCRIPT_DIR
+
+# Ensure that all scripts change their working dir to the root of the project.
+cd -- "${SCRIPT_DIR}"
+
+# User-configurable environment variables.
+# IMPORTANT: Don't modify the code below! Pass these variables via the environment!
+# NOTE: The "OT_PYTHON_VENV" is always created relative to "SCRIPT_DIR" unless
+# an absolute ("/home/foo/venv") or relative-traversal ("../venv") path is given.
+# NOTE: The Conda detection prioritizes the user-provided value, otherwise the
+# value of "$CONDA_EXE" (the env variable set by Conda's shell startup script),
+# or lastly the "conda" binary (from PATH) as final fallback. We MUST use this
+# order, otherwise we will fail to detect Conda if its startup script has executed,
+# since their script shadows "conda" as a shell-function instead of a binary!
+export OT_CONDA_CMD="${OT_CONDA_CMD:-${CONDA_EXE:-conda}}"
+export OT_CONDA_ENV="${OT_CONDA_ENV:-onetrainer}"
+export OT_PYTHON_CMD="${OT_PYTHON_CMD:-python}"
+export OT_PYTHON_VENV="${OT_PYTHON_VENV:-.venv}"
+export OT_PREFER_VENV="${OT_PREFER_VENV:-false}"
+export OT_CUDA_LOWMEM_MODE="${OT_CUDA_LOWMEM_MODE:-false}"
+export OT_PLATFORM_REQUIREMENTS="${OT_PLATFORM_REQUIREMENTS:-detect}"
+export OT_SCRIPT_DEBUG="${OT_SCRIPT_DEBUG:-false}"
+
+# Internal environment variables.
+# NOTE: Version check supports "3", "3.1" and "3.1.5" specifier formats.
+export OT_PYTHON_VERSION_MINIMUM="3"
+export OT_PYTHON_VERSION_TOO_HIGH="3.11"
+export OT_CONDA_USE_PYTHON_VERSION="3.10"
+export OT_MUST_INSTALL_REQUIREMENTS="false"
+
+# Force PyTorch to use fallbacks on Mac systems.
+if [[ "$(uname)" == "Darwin" ]]; then
+    export PYTORCH_ENABLE_MPS_FALLBACK="1"
+fi
+
+# Change PyTorch memory allocation to reduce CUDA out-of-memory situations.
+if [[ "${OT_CUDA_LOWMEM_MODE}" == "true" ]]; then
+    export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.6,max_split_size_mb:128"
+fi
+
+# Utility functions.
+function print {
+    printf "[OneTrainer] %s\n" "$*"
+}
+
+function print_error {
+    printf "Error: %s\n" "$*" >&2
+}
+
+function print_debug {
+    if [[ "${OT_SCRIPT_DEBUG}" == "true" ]]; then
+        print "$*"
+    fi
+}
+
+function regex_escape {
+    sed 's/[][\.|$(){}?+*^]/\\&/g' <<<"$*"
+}
+
+# Checks if a command exists and is executable.
+function can_exec {
+    if [[ -z "$1" ]]; then
+        print_error "can_exec requires 1 argument."
+        return 1
+    fi
+
+    if local full_path="$(command -v "$1" 2>/dev/null)"; then
+        if [[ ! -z "${full_path}" ]] && [[ -x "${full_path}" ]]; then
+            return 0
+        fi
+    fi
+
+    return 1
+}
+
+# Python command wrappers.
+function run_python {
+    print "+ ${OT_PYTHON_CMD} $*"
+    "${OT_PYTHON_CMD}" "$@"
+}
+
+function run_pip {
+    run_python -m pip "$@"
+}
+
+function run_venv {
+    run_python -m venv "$@"
+}
+
+function has_python {
+    can_exec "${OT_PYTHON_CMD}"
+}
+
+function has_python_venv {
+    [[ -f "${OT_PYTHON_VENV}/bin/activate" ]]
+}
+
+function create_python_venv {
+    print "Creating Python Venv environment in \"${OT_PYTHON_VENV}\"..."
+    run_venv "${OT_PYTHON_VENV}"
+    export OT_MUST_INSTALL_REQUIREMENTS="true"
+}
+
+function ensure_python_venv_exists {
+    if ! has_python_venv; then
+        create_python_venv
+    fi
+}
+
+function activate_python_venv {
+    # NOTE: This rewrites PATH to make all subsequent Python commands prefer
+    # to use the venv's binaries instead. You should only execute this once!
+    source "${OT_PYTHON_VENV}/bin/activate"
+
+    # NOTE: Sanity check just to ensure that the activate-script was real.
+    if [[ -z "${VIRTUAL_ENV}" ]]; then
+        print_error "Something went wrong when activating the Python Venv in \"${OT_PYTHON_VENV}\"."
+        exit 1
+    fi
+
+    # We must now force the Python binary name back to normal, since the venv's
+    # own, internal Python binary is ALWAYS named "python".
+    export OT_PYTHON_CMD="python"
+}
+
+# Conda command wrappers.
+function run_conda {
+    print "+ ${OT_CONDA_CMD} $*"
+    "${OT_CONDA_CMD}" "$@"
+}
+
+__HAS_CONDA__CACHE=""
+function has_conda {
+    # We cache the result of this check to speed up further "has_conda" calls.
+    if [[ -z "${__HAS_CONDA__CACHE}" ]]; then
+        if can_exec "${OT_CONDA_CMD}"; then
+            __HAS_CONDA__CACHE="true"
+        else
+            __HAS_CONDA__CACHE="false"
+        fi
+    fi
+
+    [[ "${__HAS_CONDA__CACHE}" == "true" ]]
+}
+
+function has_conda_env {
+    # NOTE: We perform a strict, case-sensitive check for the exact env name.
+    run_conda info --envs | grep -q -- "^$(regex_escape "${OT_CONDA_ENV}")\b"
+}
+
+function create_conda_env {
+    print "Creating Conda environment with name \"${OT_CONDA_ENV}\"..."
+    run_conda create -y -n "${OT_CONDA_ENV}" "python==${OT_CONDA_USE_PYTHON_VERSION}"
+    export OT_MUST_INSTALL_REQUIREMENTS="true"
+}
+
+function ensure_conda_env_exists {
+    if ! has_conda_env; then
+        create_conda_env
+    fi
+}
+
+function run_in_conda_env {
+    # NOTE: The "--no-capture-output" flag is necessary to print live to stdout/stderr.
+    run_conda run -n "${OT_CONDA_ENV}" --no-capture-output "$@"
+}
+
+function run_python_in_conda_env {
+    # NOTE: Python is ALWAYS called "python" inside Conda's environment.
+    run_in_conda_env python "$@"
+}
+
+function run_pip_in_conda_env {
+    run_python_in_conda_env -m pip "$@"
+}
+
+# Checks if the user hasn't requested Venv instead, and if Conda exists.
+function should_use_conda {
+    # NOTE: This check is intentionally not cached, to allow changing preference
+    # during runtime. Furthermore, "has_conda" uses caching for speed already.
+    [[ "${OT_PREFER_VENV}" != "true" ]] && has_conda
+}
+
+# Helpers which automatically run Python and Pip in either Conda or Venv/Host,
+# depending on what's available on the system or user-preference overrides.
+function activate_chosen_env {
+    if should_use_conda; then
+        print "Using Conda environment with name \"${OT_CONDA_ENV}\"..."
+        ensure_conda_env_exists
+    else
+        print "Using Python Venv environment in \"${OT_PYTHON_VENV}\"..."
+        ensure_python_venv_exists
+        activate_python_venv
+    fi
+}
+
+function run_python_in_active_env {
+    if should_use_conda; then
+        run_python_in_conda_env "$@"
+    else
+        run_python "$@"
+    fi
+}
+
+function run_pip_in_active_env {
+    if should_use_conda; then
+        run_pip_in_conda_env "$@"
+    else
+        run_pip "$@"
+    fi
+}
+
+# Determines which requirements.txt file we need to install.
+function get_platform_requirements_path {
+    # NOTE: The user can override our platform detection via the environment.
+    local platform_reqs="${OT_PLATFORM_REQUIREMENTS}"
+    if [[ "${platform_reqs}" == "detect" ]]; then
+        if [[ -e "/dev/kfd" ]]; then
+            platform_reqs="requirements-rocm.txt"
+        elif can_exec nvidia-smi || can_exec nvcc; then
+            # NOTE: Modern NVIDIA drivers don't have "nvcc" anymore.
+            platform_reqs="requirements-cuda.txt"
+        else
+            platform_reqs="requirements-default.txt"
+        fi
+    fi
+
+    if [[ -z "${platform_reqs}" ]] || [[ ! -f "${platform_reqs}" ]]; then
+        print_error "Requirements file \"${platform_reqs}\" does not exist."
+        return 1
+    fi
+
+    echo "${platform_reqs}"
+}
+
+# Installs the Global and Platform requirements into the active environment.
+function install_requirements_in_active_env {
+    # Ensure that we have the latest Python tools, and install the dependencies.
+    # NOTE: The "eager" upgrade strategy is necessary for upgrading dependencies
+    # when running in existing environments. It ensures that all libraries will
+    # be upgraded to the same versions as a fresh reinstall of requirements.txt.
+    print "Installing requirements in active environment..."
+    run_pip_in_active_env install --upgrade pip setuptools
+    run_pip_in_active_env install --upgrade --upgrade-strategy eager -r requirements-global.txt -r "$(get_platform_requirements_path)"
+    export OT_MUST_INSTALL_REQUIREMENTS="false"
+}
+
+function install_requirements_in_active_env_if_necessary {
+    if [[ "${OT_MUST_INSTALL_REQUIREMENTS}" != "false" ]]; then
+        install_requirements_in_active_env
+    fi
+}
+
+# Educates the user about the correct methods for installing Python or Conda.
+function show_runtime_solutions {
+    print "Solutions: Either install the required Python version via pyenv (https://github.com/pyenv/pyenv) and set the project directory's Python version with \"pyenv install <version>\" followed by \"pyenv local <version>\", or install Miniconda if you prefer that we automatically manage everything for you (https://docs.anaconda.com/miniconda/)."
+}
+
+# Ensures that Python or Conda exists on the host and can be executed.
+function exit_if_no_runtime {
+    # NOTE: If "should_use_conda" is true, we have a usable Conda.
+    if ! should_use_conda && ! has_python; then
+        print_error "Python command \"${OT_PYTHON_CMD}\" does not exist on your system."
+        show_runtime_solutions
+        exit 1
+    fi
+}
+
+# Verifies that Python version is ">= minimum and < too high" in Conda/Venv/Host.
+function exit_if_active_env_wrong_python_version {
+    if ! run_python_in_active_env "scripts/util/version_check.py" "${OT_PYTHON_VERSION_MINIMUM}" "${OT_PYTHON_VERSION_TOO_HIGH}"; then
+        show_runtime_solutions
+        exit 1
+    fi
+}
+
+# Performs the most important startup sanity checks and environment preparation.
+function prepare_runtime_environment {
+    # Ensure that the chosen Conda or Python runtime exists.
+    exit_if_no_runtime
+
+    # Create and activate the chosen environment.
+    activate_chosen_env
+
+    # Protect against outdated Python environments created with older versions.
+    exit_if_active_env_wrong_python_version
+
+    # If this is an upgrade, always ensure that we have the latest dependencies,
+    # otherwise only install requirements if the environment was newly created.
+    if [[ "$1" == "upgrade" ]]; then
+        install_requirements_in_active_env
+    else
+        install_requirements_in_active_env_if_necessary
+    fi
+}