Skip to content

[2026春季][T1-1-1] HyosungSink#79

Closed
HyosungSink wants to merge 11 commits into
InfiniTensor:masterfrom
HyosungSink:2026-spring-HyosungSink-T1-1-1
Closed

[2026春季][T1-1-1] HyosungSink#79
HyosungSink wants to merge 11 commits into
InfiniTensor:masterfrom
HyosungSink:2026-spring-HyosungSink-T1-1-1

Conversation

@HyosungSink
Copy link
Copy Markdown

@HyosungSink HyosungSink commented May 18, 2026

Summary

  • Implements the T1-1-1 ntops operators: rad2deg, copysign, lcm, lgamma, and nextafter.
  • Adds NineToothed kernels, ntops.torch wrappers, registration, operator-scoped correctness tests, and operator-scoped performance tests.
  • Covers 30 correctness cases and 20 performance cases per operator, for 150 correctness cases and 100 performance cases in total.
  • Includes NVIDIA, Iluvatar, and MetaX adaptations behind vendor/device gates, keeping platform-specific fast paths isolated from the NVIDIA path.
  • Connects the five operators to the InfiniCore use_ntops path while using PyTorch only as the test reference, not as the runtime implementation.
  • Reports per-case performance for all 100 cases below; total minimum ratios are NVIDIA 0.913x, Iluvatar 0.932x, and MetaX 0.901x.

Validation

  • Correctness passes on all reported platforms: NVIDIA L4/sm_89, Iluvatar MR-V100, and MetaX C500/MACA each pass the full 150-case T1-1-1 suite.
  • Performance passes on all reported platforms: NVIDIA, Iluvatar, and MetaX each pass all 100 performance cases, with every case meeting torch_ms / ntops_ms >= 0.9.
  • InfiniCore integration passes for all five operators through the use_ntops path on NVIDIA, Iluvatar, and MetaX.
  • Platform isolation was reviewed: Iluvatar and MetaX specializations are guarded by vendor/device checks, and the NVIDIA path does not enter those architecture-specific branches.
  • Submission hygiene was reviewed: no PyTorch runtime wrapping, no skipped or xfailed validation cases, and no lowered performance thresholds are used.

Performance Summary

Operator Cases NVIDIA min Iluvatar min MetaX min
rad2deg 20 0.976x 0.996x 0.926x
copysign 20 0.966x 0.975x 0.917x
lcm 20 0.913x 1.242x 0.901x
lgamma 20 0.953x 0.932x 0.969x
nextafter 20 0.961x 0.946x 0.946x
Total 100 0.913x 0.932x 0.901x

rad2deg Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio MetaX ntops MetaX torch MetaX ratio
f16
large
1d
0.2693 0.2692 1.000x 0.1186 0.1257 1.060x 0.0525 0.0489 0.933x
f32
large
1d
0.5848 0.5710 0.976x 0.2335 0.2327 0.996x 0.0959 0.0941 0.981x
f64
large
1d
1.1509 1.1407 0.991x 0.1998 0.2032 1.017x 0.1887 0.1843 0.977x
f32
large
2d
0.5830 0.5769 0.989x 0.2333 0.2329 0.998x 0.0964 0.0942 0.978x
f16
large
3d
0.2691 0.2698 1.003x 0.1184 0.1254 1.058x 0.0521 0.0487 0.935x
f32
large
3d
0.5827 0.5738 0.985x 0.2333 0.2330 0.999x 0.0961 0.0944 0.983x
f64
large
3d
1.1546 1.1424 0.990x 0.2002 0.2037 1.017x 0.1857 0.1843 0.992x
f32
large
out
1d
0.5828 0.5814 0.998x 0.2327 0.2335 1.003x 0.0956 0.0940 0.983x
f64
large
out
2d
1.1540 1.1435 0.991x 0.2004 0.2029 1.013x 0.1855 0.1839 0.992x
f16
large
out
3d
0.2661 0.2680 1.007x 0.1185 0.1238 1.045x 0.0514 0.0484 0.941x
f32
mid
1d
0.5809 0.5726 0.986x 0.2331 0.2327 0.998x 0.0957 0.0943 0.985x
f16
mid
1d
0.2668 0.2709 1.015x 0.1187 0.1252 1.055x 0.0513 0.0485 0.944x
f64
mid
1d
1.1502 1.1415 0.992x 0.2002 0.2035 1.016x 0.1894 0.1844 0.974x
f32
small
1d
0.5836 0.5780 0.990x 0.2330 0.2330 1.000x 0.0958 0.0943 0.985x
f16
noncontig
4096
0.2730 0.2688 0.985x 0.1183 0.1253 1.059x 0.0525 0.0486 0.926x
f32
noncontig
4096
0.5850 0.5714 0.977x 0.2332 0.2331 0.999x 0.0970 0.0946 0.975x
f64
noncontig
2048
0.2683 0.2663 0.993x 0.0523 0.0535 1.024x 0.0522 0.0504 0.966x
f32
noncontig
out
4096
0.5930 0.5803 0.979x 0.2327 0.2333 1.003x 0.0969 0.0941 0.971x
f32
permute3d
256x256x128
0.2682 0.2657 0.991x 0.1185 0.1183 0.998x 0.0521 0.0504 0.967x
f32
permute3d
out
256x256x128
0.2650 0.2659 1.003x 0.1181 0.1180 0.998x 0.0519 0.0495 0.952x

copysign Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio MetaX ntops MetaX torch MetaX ratio
f16
large
1d
0.4356 0.4256 0.977x 0.1718 0.1860 1.082x 0.0769 0.0705 0.917x
f32
large
1d
0.8649 0.8607 0.995x 0.3376 0.3462 1.025x 0.1440 0.1390 0.966x
f64
large
1d
1.7334 1.7010 0.981x 0.2062 0.2029 0.984x 0.2733 0.2736 1.001x
f32
large
2d
0.8719 0.8672 0.995x 0.3407 0.3456 1.014x 0.1423 0.1390 0.977x
f16
large
3d
0.4436 0.4322 0.974x 0.1714 0.1852 1.080x 0.0751 0.0704 0.938x
f32
large
3d
0.8687 0.8583 0.988x 0.3380 0.3457 1.023x 0.1424 0.1390 0.976x
f64
large
3d
1.7360 1.6895 0.973x 0.2057 0.2027 0.985x 0.2727 0.2737 1.003x
f32
large
out
1d
0.8670 0.8608 0.993x 0.3383 0.3456 1.021x 0.1405 0.1387 0.987x
f64
large
out
2d
1.7392 1.7328 0.996x 0.2062 0.2032 0.986x 0.2724 0.2733 1.003x
f16
large
out
3d
0.4426 0.4345 0.982x 0.1717 0.1855 1.081x 0.0751 0.0703 0.936x
f32
mid
1d
0.8669 0.8644 0.997x 0.3378 0.3455 1.023x 0.1405 0.1391 0.990x
f16
mid
1d
0.4343 0.4310 0.992x 0.1716 0.1857 1.082x 0.0739 0.0705 0.953x
f64
mid
1d
1.7180 1.6913 0.984x 0.2059 0.2028 0.985x 0.2716 0.2739 1.008x
f32
small
1d
0.8674 0.8591 0.990x 0.3380 0.3466 1.025x 0.1408 0.1389 0.987x
f32
broadcast
rect
2048x8192
0.2726 0.2637 0.967x 0.0982 0.1912 1.948x 0.0701 0.2643 3.773x
f32
broadcast
4096
0.2724 0.2630 0.966x 0.0963 0.1911 1.983x 0.0889 0.2219 2.496x
f16
noncontig
4096
0.4444 0.4337 0.976x 0.1716 0.1850 1.078x 0.0748 0.0720 0.962x
f32
noncontig
4096
0.8715 0.8548 0.981x 0.3381 0.3460 1.023x 0.1429 0.1389 0.972x
f64
noncontig
2048
0.4356 0.4243 0.974x 0.0546 0.0532 0.975x 0.0752 0.0732 0.973x
f32
permute3d
out
256x256x128
0.4407 0.4288 0.973x 0.1721 0.1761 1.023x 0.0755 0.0713 0.944x

lcm Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio MetaX ntops MetaX torch MetaX ratio
i32
large
1d
0.8299 0.8783 1.058x 0.7592 1.1980 1.578x 0.5388 0.5212 0.967x
i32
large
positive
1d
0.8266 0.8775 1.062x 0.5901 0.9836 1.667x 0.4818 0.4343 0.901x
i32
large
2d
0.8289 0.8758 1.057x 0.7593 1.1989 1.579x 0.5382 0.5213 0.968x
i32
large
positive
2d
0.8276 0.8743 1.057x 0.5927 0.9571 1.615x 0.4821 0.4342 0.901x
i32
large
3d
0.8328 0.8746 1.050x 0.7441 1.1598 1.559x 0.5385 0.5215 0.968x
i32
large
positive
3d
0.8289 0.8756 1.056x 0.5828 0.9516 1.633x 0.4817 0.4343 0.902x
i32
large
out
1d
0.8311 0.8608 1.036x 0.7433 1.1598 1.560x 0.5374 0.5212 0.970x
i32
large
out
2d
0.8317 0.8632 1.038x 0.7210 1.1240 1.559x 0.5384 0.5212 0.968x
i32
broadcast
8192
2.2100 2.5649 1.161x 1.6411 4.9779 3.033x 1.1024 2.2396 2.032x
i32
large
low
1d
0.8306 0.8747 1.053x 0.4153 0.7198 1.733x 0.3039 0.3471 1.142x
i16
mid
1d
0.4892 0.4483 0.916x 0.5307 0.7479 1.409x 0.5121 0.5542 1.082x
i16
large
1d
0.5039 0.4690 0.931x 0.5313 0.7483 1.408x 0.5153 0.5543 1.076x
i64
mid
1d
1.6634 1.7177 1.033x 1.2172 4.3829 3.601x 0.5994 0.9209 1.536x
i64
large
1d
1.6428 1.7353 1.056x 1.2224 4.3806 3.584x 0.5990 0.9211 1.538x
u8
mid
1d
0.4331 0.3955 0.913x 0.4417 0.5512 1.248x 0.4292 0.3880 0.904x
i8
mid
1d
0.4013 0.3805 0.948x 0.3832 0.7197 1.878x 0.4183 0.5263 1.258x
i32
noncontig
4096
0.8354 0.8741 1.046x 0.7216 1.1239 1.557x 0.5434 0.5218 0.960x
i32
noncontig
out
4096
0.8381 0.8628 1.030x 0.7197 1.1234 1.561x 0.5417 0.5212 0.962x
i16
noncontig
6144
1.2626 1.2803 1.014x 1.3421 1.6669 1.242x 1.1479 1.2346 1.076x
i32
permute3d
out
256x256x128
0.4263 0.4397 1.031x 0.3654 0.5720 1.566x 0.2778 0.2685 0.967x

lgamma Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio MetaX ntops MetaX torch MetaX ratio
f16
large
1d
0.2756 0.2679 0.972x 0.3863 0.3691 0.955x 0.2804 0.2962 1.056x
f32
large
1d
0.5803 0.5752 0.991x 0.3870 0.3609 0.932x 0.2774 0.2968 1.070x
f64
large
1d
11.4071 11.2792 0.989x 4.4191 10.0906 2.283x 3.7320 3.6298 0.973x
f32
large
2d
0.5847 0.5757 0.985x 0.3632 0.3389 0.933x 0.2775 0.2965 1.068x
f16
large
3d
0.2783 0.2676 0.962x 0.3623 0.3460 0.955x 0.2807 0.2964 1.056x
f32
large
3d
0.5839 0.5756 0.986x 0.3629 0.3386 0.933x 0.2776 0.2967 1.069x
f64
large
3d
11.3901 11.3198 0.994x 4.2130 9.8332 2.334x 3.7390 3.6237 0.969x
f32
large
out
1d
0.5823 0.5831 1.001x 0.3632 0.3388 0.933x 0.2770 0.2965 1.071x
f64
large
out
2d
11.3916 11.3031 0.992x 4.2129 9.7558 2.316x 3.7283 3.6147 0.970x
f16
large
out
3d
0.2772 0.2645 0.954x 0.3621 0.3465 0.957x 0.2803 0.2965 1.058x
f32
mid
1d
0.5850 0.5757 0.984x 0.3630 0.3386 0.933x 0.2774 0.2964 1.069x
f16
mid
1d
0.2751 0.2684 0.975x 0.3628 0.3461 0.954x 0.2805 0.2963 1.056x
f64
mid
1d
11.3780 11.3018 0.993x 4.2691 9.9091 2.321x 3.7281 3.6198 0.971x
f32
small
1d
0.5815 0.5760 0.991x 0.3631 0.3385 0.932x 0.2773 0.2966 1.069x
f16
noncontig
4096
0.2813 0.2679 0.953x 0.3623 0.3462 0.956x 0.2809 0.2965 1.056x
f32
noncontig
4096
0.5854 0.5754 0.983x 0.3631 0.3386 0.933x 0.2781 0.2967 1.067x
f64
noncontig
2048
2.9838 2.8712 0.962x 1.0582 2.2172 2.095x 0.9696 0.9401 0.970x
f32
noncontig
out
4096
0.5884 0.5841 0.993x 0.3631 0.3387 0.933x 0.2776 0.2964 1.068x
f32
permute3d
256x256x128
0.2775 0.2680 0.966x 0.1849 0.1747 0.945x 0.1442 0.1548 1.074x
f32
permute3d
out
256x256x128
0.2761 0.2685 0.972x 0.1844 0.1746 0.947x 0.1437 0.1547 1.076x

nextafter Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio MetaX ntops MetaX torch MetaX ratio
f16
large
1d
0.4218 0.4293 1.018x 0.1737 0.1883 1.084x 0.0791 0.0762 0.964x
f32
large
1d
0.8754 0.8567 0.979x 0.3334 0.3449 1.034x 0.1443 0.1404 0.973x
f64
large
1d
1.7087 1.6975 0.993x 0.7302 0.9705 1.329x 0.2760 0.2753 0.997x
f32
large
2d
0.8341 0.8579 1.029x 0.3647 0.3450 0.946x 0.1427 0.1404 0.983x
f16
large
3d
0.4265 0.4297 1.007x 0.1737 0.1875 1.079x 0.0804 0.0761 0.946x
f32
large
3d
0.8429 0.8556 1.015x 0.3333 0.3445 1.034x 0.1427 0.1405 0.984x
f64
large
3d
1.7130 1.7088 0.998x 0.7304 0.9526 1.304x 0.2765 0.2755 0.996x
f32
large
out
1d
0.8335 0.8580 1.029x 0.3338 0.3448 1.033x 0.1419 0.1402 0.988x
f64
large
out
2d
1.7086 1.7063 0.999x 0.7296 0.9514 1.304x 0.2756 0.2751 0.998x
f16
large
out
3d
0.4351 0.4314 0.991x 0.1734 0.1872 1.080x 0.0803 0.0761 0.948x
f32
mid
1d
0.8506 0.8571 1.008x 0.3335 0.3446 1.033x 0.1420 0.1404 0.988x
f16
mid
1d
0.4249 0.4276 1.006x 0.1735 0.1873 1.079x 0.0789 0.0764 0.968x
f64
mid
1d
1.7145 1.7048 0.994x 0.7293 0.9517 1.305x 0.2758 0.2753 0.998x
f32
small
1d
0.8367 0.8528 1.019x 0.3332 0.3446 1.034x 0.1421 0.1404 0.988x
f32
broadcast
rect
2048x8192
0.2683 0.2633 0.981x 0.0990 0.2595 2.622x 0.0698 0.2658 3.808x
f32
broadcast
4096
0.2675 0.2628 0.982x 0.0964 0.2597 2.693x 0.0926 0.2163 2.336x
f16
noncontig
4096
0.4291 0.4282 0.998x 0.1738 0.1874 1.078x 0.0814 0.0792 0.973x
f32
noncontig
4096
0.8486 0.8563 1.009x 0.3331 0.3446 1.035x 0.1436 0.1405 0.979x
f64
noncontig
2048
0.4375 0.4204 0.961x 0.1866 0.2450 1.313x 0.0763 0.0776 1.017x
f32
permute3d
out
256x256x128
0.4387 0.4284 0.976x 0.1701 0.1757 1.033x 0.0767 0.0747 0.974x

Notes

  • PyTorch is used only as the test reference, not as the runtime implementation.
  • Latest Iluvatar nextafter float16 fix is gated by half and iluvatar; NVIDIA paths remain on the existing ntops/NineToothed kernels.
  • MetaX C500/MACA performance was measured on the current main PR branch with the submodule hashes listed above; nextafter uses the MetaX 1024/8 configuration before the Iluvatar 256/4 branch.

@HyosungSink HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from e7ccc3b to 1053c1c Compare May 18, 2026 12:58
@HyosungSink HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from 1053c1c to 2824162 Compare May 18, 2026 13:08
@HyosungSink HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from ec69e6f to 0ccbed2 Compare May 19, 2026 16:44
@HyosungSink HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from 73667a0 to e68650b Compare May 20, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant