Skip to content

fix(tests): skip crashing torch ops on ascend#614

Draft
zhangyue207 wants to merge 1 commit into
InfiniTensor:masterfrom
zhangyue207:fix/ascend-torch-op-crash-skips
Draft

fix(tests): skip crashing torch ops on ascend#614
zhangyue207 wants to merge 1 commit into
InfiniTensor:masterfrom
zhangyue207:fix/ascend-torch-op-crash-skips

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

@zhangyue207 zhangyue207 commented May 19, 2026

Summary

  • Add nonzero and mse_loss to the Ascend vendor crash skip list in tests/test_torch_ops.py.
  • Keep the change limited to the generated torch-op test harness skip policy.

Motivation

The Ascend full CI sweep can crash inside the torch_npu path for these generated torch-op cases after many preceding tests have run. Focused runs are not consistently failing, which points to a vendor runtime state issue rather than a deterministic InfiniOps operator regression.

Closes #

Type of Change

  • feat — new feature / new operator / new platform
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA N/A Not affected No NVIDIA files or tests changed.
Iluvatar N/A Not affected No Iluvatar files or tests changed.
MetaX N/A Not affected No MetaX files or tests changed.
Cambricon N/A Not affected No Cambricon files or tests changed.
Moore N/A Not affected No Moore files or tests changed.
Ascend Yes 3375 passed, 4811 skipped, 40 warnings in 57.13s Local CI wrapper, infiniops-ci/ascend:669bca8, pytest tests/ -n 1 --devices ascend -v --tb=short.
Full `pytest` output (optional)
=============== 3375 passed, 4811 skipped, 40 warnings in 57.13s ===============
========== Summary ==========
[warn] job ascend_npu: container exited with 137 (likely docker teardown SIGKILL after clean pytest); junit XML reports no failures — treating as success

Benchmark / Performance Impact

N/A.

Notes for Reviewers

This uses the existing _VENDOR_CRASH_OPS mechanism for vendor-runtime crashes in tests/test_torch_ops.py. A first full sweep with only nonzero skipped progressed past nonzero and then failed at mse_loss; skipping both allowed the full Ascend sweep to complete with no JUnit failures.


Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • N/A — no public API changes.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • N/A — no comments or error messages were added.
  • All existing comments and error messages touched by this PR remain in English (CONTRIBUTING.md §Code/General).
  • N/A — no comments or error messages were added.

C++ Specific (if C++ files changed)

N/A — no C++ files changed.

Python Specific (if Python files changed)

  • Code is PEP 8 compliant; ruff check tests/test_torch_ops.py passes cleanly.
  • ruff format --check tests/test_torch_ops.py passes cleanly.
  • N/A — no comments were added.
  • N/A — no framework messages were added or changed.
  • No function-signature spacing was changed.
  • No control-flow spacing was changed.
  • No return spacing was changed.
  • N/A — no docstrings were added or changed.
  • N/A — no type hints were added or changed.

Testing

  • pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests).
  • For platforms not affected by this PR, the table states the reason.
  • N/A — no new operator functionality was added.
  • N/A — no test parameterization was added or changed.
  • N/A — no Payload tests were added or changed.
  • N/A — no default dtype / device parameterization was added or changed.
  • N/A — no new flaky test was added.
  • For this test-harness bug fix, the regression check is the full Ascend CI sweep that fails on master at nonzero or mse_loss and passes with this PR.

Build, CI, and Tooling

  • The project builds cleanly from a fresh local CI wrapper run with pip install .[dev] on Ascend.
  • compile_commands.json regeneration was exercised by the local CI wrapper build.
  • N/A — no new backend or device was added.
  • N/A — no backend-selection logic was changed.
  • ruff check tests/test_torch_ops.py, ruff format --check tests/test_torch_ops.py, and git diff --check pass locally.
  • N/A — no runtime dependency was added.

Documentation

  • N/A — no user-visible behavior, build flag, or developer workflow was changed.
  • N/A — no new operator, dispatch helper, or public utility was added.
  • N/A — no breaking change was introduced.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • N/A — no third-party code was added.
  • N/A — no pointer arithmetic, memory access, or bounds-checking code was changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant