[CICD] Add MetaX image build workflow and Fix containers#1211
Open
BrianPei wants to merge 18 commits into
Open
[CICD] Add MetaX image build workflow and Fix containers#1211BrianPei wants to merge 18 commits into
BrianPei wants to merge 18 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds MetaX platform installation assets (scripts, requirements, Dockerfiles) and CI workflows, plus improves log metric parsing to handle ANSI/non-pipe formats.
Changes:
- Add MetaX install scripts, environment defaults, and requirements for base/inference/train tasks.
- Introduce MetaX Dockerfiles and a dedicated GitHub Actions workflow to build/test images.
- Make metric extraction more robust by stripping ANSI escapes and parsing both pipe and non-pipe log formats.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/install/utils/retry_utils.sh | Creates filtered requirements temp files alongside the original requirements file to preserve relative -r paths. |
| tools/install/utils/pkg_utils.sh | Avoids treating the conda base environment as a named env path. |
| tools/install/metax/install_train.sh | Adds MetaX train installer for requirements + source dependency build (TransformerEngine-FL). |
| tools/install/metax/install_inference.sh | Adds MetaX inference installer for requirements. |
| tools/install/metax/install_base.sh | Adds MetaX base installer for requirements. |
| tools/install/metax/env.sh | Defines MetaX environment variables and PATH/LD_LIBRARY_PATH defaults. |
| tools/install/install_system.sh | Normalizes FLAGSCALE_ENV_NAME=base to empty to reuse conda base install location. |
| tools/install/install.sh | Allows overriding key paths via environment (conda/deps/downloads/venv). |
| tests/unit_tests/runner/test_check_results_parser.py | Adds unit tests for ANSI and non-pipe metric parsing. |
| tests/test_utils/runners/parse_benchmark_output.py | Improves parsing by stripping ANSI and using regex-based metric extraction. |
| tests/test_utils/runners/check_results.py | Improves parsing similarly and applies small formatting fixes. |
| requirements/metax/train.txt | Adds MetaX train requirements referencing MetaX base. |
| requirements/metax/inference.txt | Adds MetaX inference requirements referencing MetaX base. |
| requirements/metax/base.txt | Adds MetaX platform base requirements referencing common requirements. |
| docker/metax/Dockerfile.train | Adds MetaX train image build with install tooling invoked in image build. |
| docker/metax/Dockerfile.inference | Adds MetaX inference image build. |
| docker/metax/Dockerfile.all | Adds MetaX all-in-one task image build. |
| .github/workflows/build_image_metax.yml | Adds workflow to build/push MetaX images and run MetaX tests. |
| .github/workflows/build_image_cuda.yml | Updates runner labels/volume defaults to new infrastructure paths. |
| .github/workflows/all_tests_metax.yml | Ignores MetaX docker/workflow changes to prevent duplicate triggering. |
| .github/workflows/all_tests_common.yml | Makes checkout work when not in pull_request context by adding fallbacks. |
| .github/configs/metax.yml | Updates MetaX hardware config (C550), adds tar directory, updates volumes and quoting. |
| .github/configs/cuda.yml | Normalizes quoting/style for CUDA CI config values. |
Comments suppressed due to low confidence (1)
tests/test_utils/runners/parse_benchmark_output.py:1
- The metric parsing logic was significantly changed (ANSI stripping + regex search across any line), but the added unit tests only cover
tests/test_utils/runners/check_results.py. Add unit tests fortests/test_utils/runners/parse_benchmark_output.pyas well (ANSI + pipe/non-pipe examples) to ensure benchmark parsing stays correct.
"""Parse host_0_localhost.output log file and produce benchmark.json.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+44
to
+45
| run_cmd -d $DEBUG bash -c "cd '$FLAGSCALE_DEPS/TransformerEngine-FL' && \ | ||
| TE_FL_SKIP_CUDA=1 $pip_cmd install --root-user-action=ignore --no-build-isolation ." || return 1 |
Comment on lines
+136
to
+149
| build: | ||
| name: Build ${{ matrix.task }} | ||
| needs: prepare | ||
| runs-on: ${{ fromJson(inputs.runs_on || '["flagscale-metax-c550-gpu2-8c-256g"]') }} | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: ${{ fromJson(needs.prepare.outputs.matrix) }} | ||
| outputs: | ||
| train_tag: ${{ steps.export.outputs.train_tag }} | ||
| inference_tag: ${{ steps.export.outputs.inference_tag }} | ||
| all_tag: ${{ steps.export.outputs.all_tag }} | ||
| train_tar: ${{ steps.export.outputs.train_tar }} | ||
| inference_tar: ${{ steps.export.outputs.inference_tar }} | ||
| all_tar: ${{ steps.export.outputs.all_tar }} |
Comment on lines
+226
to
+231
| - name: Export image tag for config update | ||
| id: export | ||
| run: | | ||
| TASK="${{ matrix.task }}" | ||
| echo "${TASK}_tag=${{ steps.meta.outputs.build_tag }}" >> $GITHUB_OUTPUT | ||
| echo "${TASK}_tar=${{ steps.save_tar.outputs.tar_path }}" >> $GITHUB_OUTPUT |
Comment on lines
+121
to
+134
| - name: Set build parameters | ||
| id: params | ||
| run: | | ||
| EVENT="${{ github.event_name }}" | ||
|
|
||
| if [ "$EVENT" = "pull_request" ]; then | ||
| echo "target=dev" >> $GITHUB_OUTPUT | ||
| echo "push=true" >> $GITHUB_OUTPUT | ||
| echo "no_cache=false" >> $GITHUB_OUTPUT | ||
| else | ||
| echo "target=${{ inputs.target || 'dev' }}" >> $GITHUB_OUTPUT | ||
| echo "push=${{ inputs.push }}" >> $GITHUB_OUTPUT | ||
| echo "no_cache=${{ inputs.no_cache }}" >> $GITHUB_OUTPUT | ||
| fi |
Comment on lines
+246
to
+257
| load_images: | ||
| name: Load and push images | ||
| needs: ['build', 'summary'] | ||
| runs-on: ${{ fromJson(inputs.runs_on || '["flagscale-metax-c550-gpu2-8c-256g"]') }} | ||
| steps: | ||
| - name: Load train image from tar and push | ||
| run: | | ||
| TAR="${{ needs.build.outputs.train_tar || needs.build.outputs.all_tar }}" | ||
| TAG="${{ needs.build.outputs.train_tag || needs.build.outputs.all_tag }}" | ||
| if [ -f "$TAR" ]; then | ||
| sudo docker load -i "$TAR" | ||
| sudo docker push "$TAG" |
Comment on lines
+360
to
+361
| '["/mnt/airs-business/cicd/docker_data:/home/gitlab-runner/data", | ||
| "/mnt/airs-business/cicd/docker_tokenizers:/home/gitlab-runner/tokenizers"]' }} |
Comment on lines
49
to
+53
| for line in lines: | ||
| if "iteration" not in line: | ||
| continue | ||
|
|
||
| parts = line.split("|") | ||
| for part in parts: | ||
| part = part.strip() | ||
| for key in metric_keys: | ||
| if part.startswith(key.rstrip(":")): | ||
| match = re.search(r":\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)", part) | ||
| if match: | ||
| try: | ||
| results[key].append(float(match.group(1))) | ||
| except ValueError: | ||
| continue | ||
| for key in metric_keys: | ||
| value = _extract_metric_value(line, key) | ||
| if value is not None: | ||
| results[key].append(value) |
Comment on lines
+10
to
29
| ANSI_ESCAPE_RE = re.compile(r"\x1B\[[0-?]*[ -/]*[@-~]") | ||
|
|
||
|
|
||
| def _extract_metric_value(line, key): | ||
| """Extract a metric value from a log line, tolerating formatting variations.""" | ||
| cleaned_line = ANSI_ESCAPE_RE.sub("", line) | ||
| pattern = re.compile( | ||
| rf"{re.escape(key.rstrip(':'))}\s*:\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)", | ||
| re.IGNORECASE, | ||
| ) | ||
| match = pattern.search(cleaned_line) | ||
| if not match: | ||
| return None | ||
|
|
||
| try: | ||
| return float(match.group(1)) | ||
| except ValueError: | ||
| return None | ||
|
|
||
|
|
| FLAGSCALE_DEPS="${FLAGSCALE_DEPS:-$FLAGSCALE_HOME/deps}" | ||
| REQ_FILE="$PROJECT_ROOT/requirements/metax/train.txt" | ||
|
|
||
| SRC_DEPS_LIST="transformer-engine" |
Comment on lines
+37
to
+42
| install_transformer_engine() { | ||
| should_build_package "transformer-engine" || return 0 | ||
| set_step "Installing TransformerEngine-FL" | ||
| mkdir -p "$FLAGSCALE_DEPS" | ||
| retry_git_clone -d $DEBUG --depth 1 \ | ||
| "https://github.com/flagos-ai/TransformerEngine-FL.git" "$FLAGSCALE_DEPS/TransformerEngine-FL" "$RETRY_COUNT" || return 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add MetaX image build support and stabilize the related CI/test workflow by introducing MetaX-specific Dockerfiles, install scripts, and requirements, while fixing workflow input handling, Conda base environment behavior, requirements processing, and training log parsing.
Type of change
Changes
Checklist