Skip to content

[2026春季][T1-1-1] mygitljf#1166

Open
mygitljf wants to merge 2 commits into
InfiniTensor:mainfrom
mygitljf:2026-spring-mygitljf-T1-1-1
Open

[2026春季][T1-1-1] mygitljf#1166
mygitljf wants to merge 2 commits into
InfiniTensor:mainfrom
mygitljf:2026-spring-mygitljf-T1-1-1

Conversation

@mygitljf
Copy link
Copy Markdown

Platform Compatibility

NVIDIA A100, MetaX C500, Iluvatar MR-V100:
On Iluvatar / CoreX the bitcast store lowering path for fp16 / bf16 is not available, so copysign / nextafter fall back to torch.<op> for fp16 / bf16 on that platform via ntops.torch.utils._is_corex_compat_device. NVIDIA and MetaX still take the original ntops kernel path.

Test Commands

ntops correctness

python -m pytest \
  tests/test_rad2deg.py tests/test_copysign.py tests/test_lcm.py \
  tests/test_nextafter.py tests/test_lgamma.py -q

ntops performance

NTOPS_RUN_PERF=1 python -m pytest tests/test_<op>_perf.py -s

Set NTOPS_RUN_PERF=1 to enable. warmup=50, rep=200, median of 3.

InfiniCore correctness + performance (switch the device flag per platform)

# NVIDIA
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --nvidia
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --nvidia \
  --bench device --num_prerun 100 --num_iterations 1000
# MetaX C500
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --metax
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --metax \
  --bench device --num_prerun 50 --num_iterations 1000
# Iluvatar MR-V100
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --iluvatar
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --iluvatar \
  --bench device --num_prerun 20 --num_iterations 100

Test Results (100% pass on all three platforms)

Correctness

| Platform | Device flag | ntops pytest | InfiniCore run.py |
|---|---|
| NVIDIA A100 80GB PCIe | --nvidia | 44 passed, 4 skipped | 172 / 172 (100.0%) |
| MetaX C500 | --metax | 44 passed, 4 skipped | 172 / 172 (100.0%) |
| Iluvatar MR-V100 | --iluvatar | 44 passed, 4 skipped | 172 / 172 (100.0%) |
The 4 skipped cases are lcm on bool dtype (torch.lcm itself does not support bool).

Performance

ntops

Operator Cases ntops sum (µs) torch sum (µs) Speedup
rad2deg 18 190.10 190.59 1.003x
copysign 18 248.22 235.79 0.950x
lcm 24 1018.18 1177.45 1.156x
nextafter 18 254.00 255.25 1.005x
lgamma 18 291.97 289.02 0.990x
rag2deg copysign lcm nextafter lgamma

InfiniCore

NVIDIA A100:
x

MetaX C500:

Total tests run: 5, Passed: 5, Success rate: 100.0%
[Device] PyTorch:    9370.536 ms
[Device] InfiniCore: 10860.860 ms
Device Speedup (PyTorch/InfiniCore): 0.863x
沐曦环境 infiniCore-summary

Iluvatar MR-V100:

Total tests run: 5, Passed: 5, Success rate: 100.0%
[Device] PyTorch:      1164.031 ms
[Device] InfiniCore:   2439.888 ms
Device Speedup (PyTorch/InfiniCore): 0.477x
天数环境 infiniCore测试summaryxxxxxxxxxx Total tests run: 5, Passed: 5, Success rate: 100.0%[Device] PyTorch:     1164.031 ms[Device] InfiniCore:   2439.888 msDevice Speedup (PyTorch/InfiniCore): 0.477x

mygitljf added 2 commits May 18, 2026 12:46
…n dispatch and tests

Wires the five T1-1-1 operators through infinicore.ntops.torch on CUDA:

- python/infinicore/ops/{rad2deg,copysign,lcm,nextafter,lgamma}.py: thin
  dispatchers calling infinicore.ntops.torch.<op>.
- python/infinicore/__init__.py: re-export the five ops.
- test/infinicore/ops/{rad2deg,copysign,lcm,nextafter,lgamma}.py: framework
  tests covering OUT_OF_PLACE and INPLACE(out=c) on float16/bfloat16/float32
  (lcm: int8/int16/int32/int64). nextafter, copysign, lcm, lgamma run
  bit-exact against torch.

Verified on NVIDIA A100 80GB PCIe with --nvidia (172/172 passed).
…ache_for_benchmark

Some Triton driver backends (e.g. MetaX MACA's MacaDriver) do not
implement Triton benchmark's `get_empty_cache_for_benchmark` /
`clear_cache` helpers. Calling them eagerly aborts the run before any
op is ever dispatched.

Probe the driver with `getattr` + `callable` and only install the
cache-clear hook when both helpers exist. Backends that expose them
(e.g. NVIDIA's CudaDriver) keep the original behavior; backends that
do not simply skip cache clearing - correctness and device-event
timing are unaffected.

Verification (MetaX C500, --metax):
  InfiniCore run.py --bench device --num_prerun 50 --num_iterations 1000
  for rad2deg copysign lcm nextafter lgamma:
    Total tests run: 5, Passed: 5
    [Device] PyTorch:    110695.750 ms
    [Device] InfiniCore: 108326.593 ms
    Device Speedup (PyTorch/InfiniCore): 1.022x
@mygitljf mygitljf requested a review from a team May 19, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant