-
Notifications
You must be signed in to change notification settings - Fork 388
contrib/sort: enable f16/f64 sorting on no native FP #2737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
67ec002 to
b9334bf
Compare
…ive FP) Add an integer-domain total-order comparator and use it for float16_t/double on targets lacking native fp operations. Preserves IEEE intent (e.g., -0.0 < +0.0); Wires into sort/select traits, removes FP-only guards, and enables basic tests. This is part of enabling Highway vqsort for NumPy on x86.
b9334bf to
b97c259
Compare
jan-wassenberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, it's great to make this unconditionally available.
Generally LGTM but the -0 != 0 part is surprising, do we really need that?
hwy/contrib/sort/sort_unit_test.cc
Outdated
| } | ||
|
|
||
| template <class TF, class V, class D = DFromV<V>, HWY_IF_UNSIGNED_D(D)> | ||
| HWY_API Mask<D> IsInfBin(V v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we improve the name somehow? Maybe IsInfWrapper or IsInfEmulated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done IsInfWrapper/IsNaNWrapper
hwy/contrib/sort/sort_unit_test.cc
Outdated
| auto in = hwy::AllocateAligned<T>(num); | ||
| HWY_ASSERT(in); | ||
| Fill(d, GetLane(Inf(d)), num, in.get()); | ||
| Fill(d, BitCastScalar<T>(ExponentMask<TF>()), num, in.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also make a wrapper function for Inf() in the same way as IsInf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just discovered NegativeInfOrLowestValue/PositiveInfOrHighestValue which can be used instead
IMHO, treating For reference, here’s a benchmark: NumPy’s upcoming patch (abandoning x86-simd-sort), tested on an AMD Ryzen 7 7700X (no native FP16 support), Clang 19.1.7 |
|
FWIW -0 vs 0 was not specified because we just went with what IEEE754 says. |
|
As discussed, I understand that the ordering of equivalent keys can anyway change in quicksort, so it's fine to keep the current -0 != 0. |
- Replace std::iota with IotaWrapper for fp16 (ARM/ppc) - Use PositiveInfOrHighestValue in emulated traits - Rename IsNaN/IsInf Bin → Wrapper - Update tests and call sites (traits, order-emulate, vqsort)
45d5873 to
9282c16
Compare
jan-wassenberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplifications, I especially like IsEmulatedMinMax, thanks!
Wrapper LGTM.
One remaining concern about inheritance of Descending from Ascending, apart from that good to go.
hwy/contrib/sort/order-emulate-inl.h
Outdated
| using TF = typename Base::KeyType; | ||
|
|
||
| HWY_INLINE bool Compare1(const T* a, const T* b) const { | ||
| return reinterpret_cast<_Asc>(this)->Compare1(b, a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reinterpret_cast and inheritance seems questionable here.
We could instead use a typedef like you have (but no _ prefix because that is reserved by C++), and instead write return AscBase().Compare(d, b, a) - WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, done
Not, yet. CI trigger errors due to missing /home/runner/work/highway/highway/hwy/contrib/sort/sort_unit_test.cc:84:73: error: invalid operands to binary expression ('hwy::float16_t' and 'hwy::float16_t')
const Vec<D> epsp1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1) + hwy::Epsilon<TF>()));
~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~
/home/runner/work/highway/highway/hwy/tests/test_util-inl.h:265:7: note: in instantiation of function template specialization 'hwy::N_SSSE3::(anonymous namespace)::TestFloatLargerSmaller::operator()<hwy::float16_t, hwy::N_SSSE3::Simd<hwy::float16_t, 8, 0>>' requested here
Test()(T(), d);
^
/home/runner/work/highway/highway/hwy/tests/test_util-inl.h:411:61: note: in instantiation of member function 'hwy::N_SSSE3::detail::ForeachCappedR<hwy::float16_t, 8, 1, hwy::N_SSSE3::(anonymous namespace)::TestFloatLargerSmaller, 0>::Do' requested hereseems the following branch need to refine: Lines 1228 to 1250 in 0bb1960
|
|
I'm going to avoid using operator + in here instead for now to bypass ci errors: highway/hwy/contrib/sort/sort_unit_test.cc Lines 80 to 85 in 746c513
|
Agreed, what we usually do for bf16 and fp16 is to |
|
To clarify:
|
Regarding Now, all expected tests passed except for a umpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
102 | _mm_prefetch(reinterpret_cast<const char*>(p), _MM_HINT_T0);
| ^
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmxWe pass |
jan-wassenberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Internal tests/builds are unfortunately failing due to a build timeout in optimized asan builds for the f16 file. What do you think of splitting it up into separate files for Sort vs Select vs PartialSort? Disabling f16 for asan is probably not desirable. |
That could be a straightforward fix. The current CPU dispatch approach can exhaust available physical memory, forcing the system to fall back on swap, which is likely why may hitting the limit. Have you tried reducing the build parallelism e.g.
Using |
|
Ah, possible that swapping might slow things down, yes. Unfortunately we don't have control over the build machines. Good point about PCH. We can least clang header modules in the BUILD file. I'll try that. |
Add an integer-domain total-order comparator and use it for float16_t/double on targets lacking native fp operations. Preserves IEEE intent (e.g., -0.0 < +0.0); Wires into sort/select traits, removes FP-only guards, and enables basic tests.
This is part of enabling Highway vqsort for NumPy on x86, as intel-x86-sort provides emulation for half-precision