Skip to content

[CICD] Add MetaX image build workflow and Fix containers#1211

Open
BrianPei wants to merge 18 commits into
flagos-ai:mainfrom
BrianPei:PR-0528-MetaxBuild
Open

[CICD] Add MetaX image build workflow and Fix containers#1211
BrianPei wants to merge 18 commits into
flagos-ai:mainfrom
BrianPei:PR-0528-MetaxBuild

Conversation

@BrianPei
Copy link
Copy Markdown
Contributor

Description

Add MetaX image build support and stabilize the related CI/test workflow by introducing MetaX-specific Dockerfiles, install scripts, and requirements, while fixing workflow input handling, Conda base environment behavior, requirements processing, and training log parsing.

Type of change

  • Infra/Build change (changes to CI/CD workflows or build scripts)
  • Bug fix
  • Code refactoring
  • New feature (non-breaking change which adds functionality)
  • Documentation change
  • Breaking change

Changes

  • Added MetaX build workflow and MetaX-specific Dockerfiles.
  • Added MetaX install scripts and requirements for image-based dependency setup.
  • Fixed workflow input handling and image fallback logic for test execution.
  • Fixed Conda base environment and requirements include handling in installer utilities.
  • Improved training and benchmark log parsing for MetaX functional tests.
  • Fixed Cuda and Metax default container volumes.

Checklist

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in CI workflow setup steps
  • My changes generate no new warnings
  • I have tested my feature on Metax platform

Copilot AI review requested due to automatic review settings May 28, 2026 05:18
@BrianPei BrianPei requested a review from aoyulong as a code owner May 28, 2026 05:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds MetaX platform installation assets (scripts, requirements, Dockerfiles) and CI workflows, plus improves log metric parsing to handle ANSI/non-pipe formats.

Changes:

  • Add MetaX install scripts, environment defaults, and requirements for base/inference/train tasks.
  • Introduce MetaX Dockerfiles and a dedicated GitHub Actions workflow to build/test images.
  • Make metric extraction more robust by stripping ANSI escapes and parsing both pipe and non-pipe log formats.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tools/install/utils/retry_utils.sh Creates filtered requirements temp files alongside the original requirements file to preserve relative -r paths.
tools/install/utils/pkg_utils.sh Avoids treating the conda base environment as a named env path.
tools/install/metax/install_train.sh Adds MetaX train installer for requirements + source dependency build (TransformerEngine-FL).
tools/install/metax/install_inference.sh Adds MetaX inference installer for requirements.
tools/install/metax/install_base.sh Adds MetaX base installer for requirements.
tools/install/metax/env.sh Defines MetaX environment variables and PATH/LD_LIBRARY_PATH defaults.
tools/install/install_system.sh Normalizes FLAGSCALE_ENV_NAME=base to empty to reuse conda base install location.
tools/install/install.sh Allows overriding key paths via environment (conda/deps/downloads/venv).
tests/unit_tests/runner/test_check_results_parser.py Adds unit tests for ANSI and non-pipe metric parsing.
tests/test_utils/runners/parse_benchmark_output.py Improves parsing by stripping ANSI and using regex-based metric extraction.
tests/test_utils/runners/check_results.py Improves parsing similarly and applies small formatting fixes.
requirements/metax/train.txt Adds MetaX train requirements referencing MetaX base.
requirements/metax/inference.txt Adds MetaX inference requirements referencing MetaX base.
requirements/metax/base.txt Adds MetaX platform base requirements referencing common requirements.
docker/metax/Dockerfile.train Adds MetaX train image build with install tooling invoked in image build.
docker/metax/Dockerfile.inference Adds MetaX inference image build.
docker/metax/Dockerfile.all Adds MetaX all-in-one task image build.
.github/workflows/build_image_metax.yml Adds workflow to build/push MetaX images and run MetaX tests.
.github/workflows/build_image_cuda.yml Updates runner labels/volume defaults to new infrastructure paths.
.github/workflows/all_tests_metax.yml Ignores MetaX docker/workflow changes to prevent duplicate triggering.
.github/workflows/all_tests_common.yml Makes checkout work when not in pull_request context by adding fallbacks.
.github/configs/metax.yml Updates MetaX hardware config (C550), adds tar directory, updates volumes and quoting.
.github/configs/cuda.yml Normalizes quoting/style for CUDA CI config values.
Comments suppressed due to low confidence (1)

tests/test_utils/runners/parse_benchmark_output.py:1

  • The metric parsing logic was significantly changed (ANSI stripping + regex search across any line), but the added unit tests only cover tests/test_utils/runners/check_results.py. Add unit tests for tests/test_utils/runners/parse_benchmark_output.py as well (ANSI + pipe/non-pipe examples) to ensure benchmark parsing stays correct.
"""Parse host_0_localhost.output log file and produce benchmark.json.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +44 to +45
run_cmd -d $DEBUG bash -c "cd '$FLAGSCALE_DEPS/TransformerEngine-FL' && \
TE_FL_SKIP_CUDA=1 $pip_cmd install --root-user-action=ignore --no-build-isolation ." || return 1
Comment on lines +136 to +149
build:
name: Build ${{ matrix.task }}
needs: prepare
runs-on: ${{ fromJson(inputs.runs_on || '["flagscale-metax-c550-gpu2-8c-256g"]') }}
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.prepare.outputs.matrix) }}
outputs:
train_tag: ${{ steps.export.outputs.train_tag }}
inference_tag: ${{ steps.export.outputs.inference_tag }}
all_tag: ${{ steps.export.outputs.all_tag }}
train_tar: ${{ steps.export.outputs.train_tar }}
inference_tar: ${{ steps.export.outputs.inference_tar }}
all_tar: ${{ steps.export.outputs.all_tar }}
Comment on lines +226 to +231
- name: Export image tag for config update
id: export
run: |
TASK="${{ matrix.task }}"
echo "${TASK}_tag=${{ steps.meta.outputs.build_tag }}" >> $GITHUB_OUTPUT
echo "${TASK}_tar=${{ steps.save_tar.outputs.tar_path }}" >> $GITHUB_OUTPUT
Comment on lines +121 to +134
- name: Set build parameters
id: params
run: |
EVENT="${{ github.event_name }}"

if [ "$EVENT" = "pull_request" ]; then
echo "target=dev" >> $GITHUB_OUTPUT
echo "push=true" >> $GITHUB_OUTPUT
echo "no_cache=false" >> $GITHUB_OUTPUT
else
echo "target=${{ inputs.target || 'dev' }}" >> $GITHUB_OUTPUT
echo "push=${{ inputs.push }}" >> $GITHUB_OUTPUT
echo "no_cache=${{ inputs.no_cache }}" >> $GITHUB_OUTPUT
fi
Comment on lines +246 to +257
load_images:
name: Load and push images
needs: ['build', 'summary']
runs-on: ${{ fromJson(inputs.runs_on || '["flagscale-metax-c550-gpu2-8c-256g"]') }}
steps:
- name: Load train image from tar and push
run: |
TAR="${{ needs.build.outputs.train_tar || needs.build.outputs.all_tar }}"
TAG="${{ needs.build.outputs.train_tag || needs.build.outputs.all_tag }}"
if [ -f "$TAR" ]; then
sudo docker load -i "$TAR"
sudo docker push "$TAG"
Comment on lines +360 to +361
'["/mnt/airs-business/cicd/docker_data:/home/gitlab-runner/data",
"/mnt/airs-business/cicd/docker_tokenizers:/home/gitlab-runner/tokenizers"]' }}
Comment on lines 49 to +53
for line in lines:
if "iteration" not in line:
continue

parts = line.split("|")
for part in parts:
part = part.strip()
for key in metric_keys:
if part.startswith(key.rstrip(":")):
match = re.search(r":\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)", part)
if match:
try:
results[key].append(float(match.group(1)))
except ValueError:
continue
for key in metric_keys:
value = _extract_metric_value(line, key)
if value is not None:
results[key].append(value)
Comment on lines +10 to 29
ANSI_ESCAPE_RE = re.compile(r"\x1B\[[0-?]*[ -/]*[@-~]")


def _extract_metric_value(line, key):
"""Extract a metric value from a log line, tolerating formatting variations."""
cleaned_line = ANSI_ESCAPE_RE.sub("", line)
pattern = re.compile(
rf"{re.escape(key.rstrip(':'))}\s*:\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)",
re.IGNORECASE,
)
match = pattern.search(cleaned_line)
if not match:
return None

try:
return float(match.group(1))
except ValueError:
return None


FLAGSCALE_DEPS="${FLAGSCALE_DEPS:-$FLAGSCALE_HOME/deps}"
REQ_FILE="$PROJECT_ROOT/requirements/metax/train.txt"

SRC_DEPS_LIST="transformer-engine"
Comment on lines +37 to +42
install_transformer_engine() {
should_build_package "transformer-engine" || return 0
set_step "Installing TransformerEngine-FL"
mkdir -p "$FLAGSCALE_DEPS"
retry_git_clone -d $DEBUG --depth 1 \
"https://github.com/flagos-ai/TransformerEngine-FL.git" "$FLAGSCALE_DEPS/TransformerEngine-FL" "$RETRY_COUNT" || return 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants