[2026春季][T1-1-1] mygitljf by mygitljf · Pull Request #1166 · InfiniTensor/InfiniCore

mygitljf · 2026-05-19T18:04:53Z

Platform Compatibility

NVIDIA A100, MetaX C500, Iluvatar MR-V100:
On Iluvatar / CoreX the bitcast store lowering path for fp16 / bf16 is not available, so copysign / nextafter fall back to torch.<op> for fp16 / bf16 on that platform via ntops.torch.utils._is_corex_compat_device. NVIDIA and MetaX still take the original ntops kernel path.

Test Commands

ntops correctness

python -m pytest \
  tests/test_rad2deg.py tests/test_copysign.py tests/test_lcm.py \
  tests/test_nextafter.py tests/test_lgamma.py -q

ntops performance

NTOPS_RUN_PERF=1 python -m pytest tests/test_<op>_perf.py -s

Set NTOPS_RUN_PERF=1 to enable. warmup=50, rep=200, median of 3.

InfiniCore correctness + performance (switch the device flag per platform)

# NVIDIA
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --nvidia
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --nvidia \
  --bench device --num_prerun 100 --num_iterations 1000
# MetaX C500
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --metax
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --metax \
  --bench device --num_prerun 50 --num_iterations 1000
# Iluvatar MR-V100
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --iluvatar
python test/infinicore/run.py --ops rad2deg copysign lcm nextafter lgamma --iluvatar \
  --bench device --num_prerun 20 --num_iterations 100

Test Results (100% pass on all three platforms)

Correctness

| Platform | Device flag | ntops pytest | InfiniCore run.py |
|---|---|
| NVIDIA A100 80GB PCIe | --nvidia | 44 passed, 4 skipped | 172 / 172 (100.0%) |
| MetaX C500 | --metax | 44 passed, 4 skipped | 172 / 172 (100.0%) |
| Iluvatar MR-V100 | --iluvatar | 44 passed, 4 skipped | 172 / 172 (100.0%) |
The 4 skipped cases are lcm on bool dtype (torch.lcm itself does not support bool).

Performance

ntops

Operator	Cases	ntops sum (µs)	torch sum (µs)	Speedup
rad2deg	18	190.10	190.59	1.003x
copysign	18	248.22	235.79	0.950x
lcm	24	1018.18	1177.45	1.156x
nextafter	18	254.00	255.25	1.005x
lgamma	18	291.97	289.02	0.990x

InfiniCore

NVIDIA A100:

MetaX C500:

Total tests run: 5, Passed: 5, Success rate: 100.0%
[Device] PyTorch:    9370.536 ms
[Device] InfiniCore: 10860.860 ms
Device Speedup (PyTorch/InfiniCore): 0.863x

Iluvatar MR-V100:

Total tests run: 5, Passed: 5, Success rate: 100.0%
[Device] PyTorch:      1164.031 ms
[Device] InfiniCore:   2439.888 ms
Device Speedup (PyTorch/InfiniCore): 0.477x

xxxxxxxxxx Total tests run: 5, Passed: 5, Success rate: 100.0%[Device] PyTorch: 1164.031 ms[Device] InfiniCore: 2439.888 msDevice Speedup (PyTorch/InfiniCore): 0.477x

…n dispatch and tests Wires the five T1-1-1 operators through infinicore.ntops.torch on CUDA: - python/infinicore/ops/{rad2deg,copysign,lcm,nextafter,lgamma}.py: thin dispatchers calling infinicore.ntops.torch.<op>. - python/infinicore/__init__.py: re-export the five ops. - test/infinicore/ops/{rad2deg,copysign,lcm,nextafter,lgamma}.py: framework tests covering OUT_OF_PLACE and INPLACE(out=c) on float16/bfloat16/float32 (lcm: int8/int16/int32/int64). nextafter, copysign, lcm, lgamma run bit-exact against torch. Verified on NVIDIA A100 80GB PCIe with --nvidia (172/172 passed).

…ache_for_benchmark Some Triton driver backends (e.g. MetaX MACA's MacaDriver) do not implement Triton benchmark's `get_empty_cache_for_benchmark` / `clear_cache` helpers. Calling them eagerly aborts the run before any op is ever dispatched. Probe the driver with `getattr` + `callable` and only install the cache-clear hook when both helpers exist. Backends that expose them (e.g. NVIDIA's CudaDriver) keep the original behavior; backends that do not simply skip cache clearing - correctness and device-event timing are unaffected. Verification (MetaX C500, --metax): InfiniCore run.py --bench device --num_prerun 50 --num_iterations 1000 for rad2deg copysign lcm nextafter lgamma: Total tests run: 5, Passed: 5 [Device] PyTorch: 110695.750 ms [Device] InfiniCore: 108326.593 ms Device Speedup (PyTorch/InfiniCore): 1.022x

mygitljf added 2 commits May 18, 2026 12:46

mygitljf requested a review from a team May 19, 2026 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2026春季][T1-1-1] mygitljf#1166

[2026春季][T1-1-1] mygitljf#1166
mygitljf wants to merge 2 commits into
InfiniTensor:mainfrom
mygitljf:2026-spring-mygitljf-T1-1-1

mygitljf commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mygitljf commented May 19, 2026

Platform Compatibility

Test Commands

ntops correctness

ntops performance

InfiniCore correctness + performance (switch the device flag per platform)

Test Results (100% pass on all three platforms)

Correctness

Performance

ntops

InfiniCore

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant