Skip to content

[CICD] Expand unit test coverage on CUDA#1210

Open
BrianPei wants to merge 10 commits into
flagos-ai:mainfrom
BrianPei:PR-0528
Open

[CICD] Expand unit test coverage on CUDA#1210
BrianPei wants to merge 10 commits into
flagos-ai:mainfrom
BrianPei:PR-0528

Conversation

@BrianPei
Copy link
Copy Markdown
Contributor

Description

Improve test coverage across runner-related modules by adding broad unit test coverage for runner, elastic, platforms, transformations, and smaller utility modules, while also stabilizing the CUDA and MetaX CI setup through runner label, volume, and coverage configuration fixes.

Type of change

  • Infra/Build change (changes to CI/CD workflows or build scripts)
  • Bug fix
  • Code refactoring
  • New feature (non-breaking change which adds functionality)
  • Documentation change
  • Breaking change

Changes

  • Added extensive unit tests for runner, runner elastic, platforms, transformations, agent tool matching, and other small modules.
  • Improved runner and elastic test coverage for diagnostic, health check, monitor, launcher, and helper paths.
  • Refined unit test coverage collection to focus on the FlagScale source tree and exclude non-target directories.
  • Fixed CUDA workflow defaults for runner labels, mounted volumes, and reusable workflow input handling.
  • Updated MetaX CI configuration for runner labels, shared dataset/tokenizer mounts, and image tar storage paths.

Checklist

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in CI workflow setup steps
  • My changes generate no new warnings
  • I have tested my changes with the relevant unit test and CI-related flows

Copilot AI review requested due to automatic review settings May 28, 2026 05:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR expands unit test coverage across transformations, runner utilities, elastic monitoring/diagnostics, platform detection, and updates CI/coverage configuration to better target package code during reporting.

Changes:

  • Added new unit tests for transformations (selectors/hooks/state scoping/log IO), runner utilities/launchers/factory/base, elastic monitoring/diagnostics/gpu health check, and platform management.
  • Updated coverage configuration generation to focus on flagscale/ and omit non-target paths.
  • Adjusted CI workflow defaults for CUDA image builds and platform config for MetaX.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/unit_tests/transformations/test_transformation.py Adds selector/transformation registry/config tests.
tests/unit_tests/transformations/test_state_store.py Extends StateStore coverage (init args/reset/BaseState).
tests/unit_tests/transformations/test_state_scope_transform.py Adds state-scope warning + cleanup callback tests.
tests/unit_tests/transformations/test_log_io_transformation.py Adds LogIO hook/transformation tests.
tests/unit_tests/transformations/test_hook.py Extends hook registry behavior/order/stateful propagation tests.
tests/unit_tests/test_small_modules.py Adds tests for several small utility modules and integration helpers.
tests/unit_tests/test_agent_tool_matcher.py Adds ToolMatcher scoring/degradation/cache/init-path tests.
tests/unit_tests/runner/test_runner_utils.py Adds runner utils tests (args/envs/cmd update/log discovery/SSH helpers).
tests/unit_tests/runner/test_runner_train_inference_helpers.py Adds train/inference runner helper tests (config updates + arg generation).
tests/unit_tests/runner/test_runner_factory.py Adds RunnerFactory registry tests with stubbed imports.
tests/unit_tests/runner/test_runner_base.py Adds Runner behavior tests (backend selection + launcher delegation).
tests/unit_tests/runner/test_launcher_ssh.py Adds extensive SSH launcher tests (scripts, health checks, profiling, status).
tests/unit_tests/runner/elastic/test_simulated_fault.py Adds simulated fault loop/CLI tests.
tests/unit_tests/runner/elastic/test_monitor_service.py Adds MonitorService threading, hang detection, diagnostics, pid checks tests.
tests/unit_tests/runner/elastic/test_monitor_launcher.py Adds monitor launcher/runner tests (pid wait, stop conditions, errors).
tests/unit_tests/runner/elastic/test_log_collector.py Extends log collector tests (file resolution, offsets, local/remote paths).
tests/unit_tests/runner/elastic/test_gpu_health_check.py Greatly expands GPU health check branch/path coverage (with torch stubs).
tests/unit_tests/runner/elastic/test_diagnostic.py Extends diagnostic report tests (offsets/helpers/incremental output/errors).
tests/unit_tests/platforms/test_platforms.py Adds platform registration/selection tests with optional torch stubbing.
tests/test_utils/runners/run_unit_tests.sh Refines coverage config to focus on flagscale/ and omit non-target paths.
flagscale/train/megatron/training/arguments_fs.py Formatting fixes + adds Engram CLI argument group.
.github/workflows/build_image_cuda.yml Updates default runner labels/volumes and uses JSON-parsed runs-on.
.github/configs/metax.yml Updates MetaX config paths and adds shared tar dir setting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

name: Build ${{ matrix.task }}
needs: prepare
runs-on: [self-hosted, Linux, X64, nvidia-0, gpus-8]
runs-on: ${{ fromJson(inputs.runs_on || '["flagscale-nvidia-a100-gpu2-32c-128g"]') }}
name: Load and push images
needs: ['build', 'summary']
runs-on: [self-hosted, Linux, X64, nvidia-0, gpus-8]
runs-on: ${{ fromJson(inputs.runs_on || '["flagscale-nvidia-a100-gpu2-32c-128g"]') }}
Comment on lines +58 to +59
["/mnt/airs-business/cicd/baai_datasets:/home/gitlab-runner/data",
"/mnt/airs-business/cicd/baai_tokenizers:/home/gitlab-runner/tokenizers"]
Comment on lines +360 to +361
'["/mnt/airs-business/cicd/docker_data:/home/gitlab-runner/data",
"/mnt/airs-business/cicd/docker_tokenizers:/home/gitlab-runner/tokenizers"]' }}
import pytest
from omegaconf import OmegaConf

torch = pytest.importorskip("torch")
warn.assert_called_once()


def test_add_decive_extra_config_merges_matching_device_and_preserves_other_dicts():
}


def test_add_decive_extra_config_without_device_returns_full_config():
Comment on lines +60 to +61
self.assertEqual(store._state_by_scope, {})
self.assertIsNone(store._active_scope)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants