contrib/sort: enable f16/f64 sorting on no native FP #2737

seiko2plus · 2025-09-28T11:04:00Z

Add an integer-domain total-order comparator and use it for float16_t/double on targets lacking native fp operations. Preserves IEEE intent (e.g., -0.0 < +0.0); Wires into sort/select traits, removes FP-only guards, and enables basic tests.

This is part of enabling Highway vqsort for NumPy on x86, as intel-x86-sort provides emulation for half-precision

hwy/contrib/sort/order-emulate-inl.h

…ive FP) Add an integer-domain total-order comparator and use it for float16_t/double on targets lacking native fp operations. Preserves IEEE intent (e.g., -0.0 < +0.0); Wires into sort/select traits, removes FP-only guards, and enables basic tests. This is part of enabling Highway vqsort for NumPy on x86.

jan-wassenberg

Nice, it's great to make this unconditionally available.
Generally LGTM but the -0 != 0 part is surprising, do we really need that?

hwy/contrib/sort/order-emulate-inl.h

jan-wassenberg · 2025-09-29T10:55:46Z

hwy/contrib/sort/sort_unit_test.cc

+} 
+
+template <class TF, class V, class D = DFromV<V>, HWY_IF_UNSIGNED_D(D)>
+HWY_API Mask<D> IsInfBin(V v) {


Can we improve the name somehow? Maybe IsInfWrapper or IsInfEmulated?

done IsInfWrapper/IsNaNWrapper

jan-wassenberg · 2025-09-29T10:56:36Z

hwy/contrib/sort/sort_unit_test.cc

    auto in = hwy::AllocateAligned<T>(num);
    HWY_ASSERT(in);
-    Fill(d, GetLane(Inf(d)), num, in.get());
+    Fill(d, BitCastScalar<T>(ExponentMask<TF>()), num, in.get());


Should we also make a wrapper function for Inf() in the same way as IsInf?

Just discovered NegativeInfOrLowestValue/PositiveInfOrHighestValue which can be used instead

seiko2plus · 2025-09-29T11:59:02Z

but the -0 != 0 part is surprising, do we really need that?

IMHO, treating -0.0 < 0.0, and -0.0 != 0.0 as true is numerically reasonable for sorting. Enforcing (-0 == 0) == true and (-0 < 0) == false would require extra mask/blend steps to emulate floating-point behavior, which seems unnecessary here and adds overhead. If I’m not mistaken, the ordering of ±0 is currently unspecified, so allowing -0.0 < 0.0 is acceptable and improves performance.

For reference, here’s a benchmark: hwy::vqsort vs Intel’s sort using the NumPy benchmark suite.

NumPy’s upcoming patch (abandoning x86-simd-sort), tested on an AMD Ryzen 7 7700X (no native FP16 support), Clang 19.1.7


| Change   | Before [4486271c] <hwy-x86-vqsort~4>   | After [3310265e] <hwy-x86-vqsort>   |   Ratio | Benchmark (Parameter)                                                                    |
|----------|----------------------------------------|-------------------------------------|---------|------------------------------------------------------------------------------------------|
| +        | 19.4±0.08ms                            | 23.5±0.3ms                          |    1.21 | bench_function_base.Sort.time_sort('heap', 'uint32', ('reversed',))                      |
| +        | 19.4±0.01ms                            | 23.5±0.3ms                          |    1.21 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))                      |
| +        | 19.4±0.01ms                            | 23.4±0.7ms                          |    1.2  | bench_function_base.Sort.time_sort('heap', 'int32', ('reversed',))                       |
| +        | 19.5±0.01ms                            | 23.4±0.6ms                          |    1.2  | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))                      |
| +        | 19.4±0.01ms                            | 23.3±0.4ms                          |    1.2  | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))                     |
| +        | 19.5±0.01ms                            | 23.1±0.2ms                          |    1.19 | bench_function_base.Sort.time_sort('heap', 'int32', ('ordered',))                        |
| +        | 19.5±0.01ms                            | 23.1±0.3ms                          |    1.19 | bench_function_base.Sort.time_sort('heap', 'uint32', ('ordered',))                       |
| +        | 19.5±0.01ms                            | 23.1±0.3ms                          |    1.19 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))                       |
| +        | 598±20μs                               | 701±40μs                            |    1.17 | bench_function_base.Sort.time_argsort('merge', 'float32', ('ordered',))                  |
| +        | 19.9±0.1ms                             | 23.4±0.2ms                          |    1.17 | bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))                      |
| +        | 19.9±0.07ms                            | 23.3±0.3ms                          |    1.17 | bench_function_base.Sort.time_sort('heap', 'float32', ('reversed',))                     |
| +        | 19.9±0.04ms                            | 23.3±0.5ms                          |    1.17 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))                    |
| +        | 19.9±0.02ms                            | 23.1±0.6ms                          |    1.16 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))                     |
| +        | 205±0.6μs                              | 229±0.6μs                           |    1.11 | bench_function_base.Partition.time_argpartition('float16', ('sorted_block', 1000), 10)   |
| +        | 206±0.5μs                              | 229±0.3μs                           |    1.11 | bench_function_base.Partition.time_argpartition('float16', ('sorted_block', 1000), 100)  |
| +        | 207±0.08μs                             | 231±0.2μs                           |    1.11 | bench_function_base.Partition.time_argpartition('float16', ('sorted_block', 1000), 1000) |
| +        | 20.3±0.03ms                            | 22.5±0.05ms                         |    1.11 | bench_function_base.Sort.time_sort('heap', 'int32', ('random',))                         |
| +        | 20.3±0.08ms                            | 22.5±0.03ms                         |    1.11 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))                        |
| +        | 20.3±0.01ms                            | 22.5±0.05ms                         |    1.11 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))                       |
| +        | 20.4±0.1ms                             | 22.5±0.03ms                         |    1.1  | bench_function_base.Sort.time_sort('heap', 'uint32', ('random',))                        |
| +        | 20.9±0.1ms                             | 22.8±0.02ms                         |    1.09 | bench_function_base.Sort.time_sort('heap', 'float32', ('random',))                       |
| +        | 21.2±0.01ms                            | 23.1±0.05ms                         |    1.09 | bench_function_base.Sort.time_sort('heap', 'int32', ('sorted_block', 10))                |
| +        | 21.2±0.02ms                            | 23.2±0.03ms                         |    1.09 | bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 10))               |
| +        | 20.9±0.08ms                            | 22.8±0.2ms                          |    1.09 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))                      |
| +        | 21.3±0.03ms                            | 23.3±0.04ms                         |    1.09 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))               |
| +        | 21.3±0.06ms                            | 23.1±0.2ms                          |    1.09 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))              |
| +        | 21.3±0.04ms                            | 23.1±0.06ms                         |    1.09 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))              |
| +        | 21.8±0.03ms                            | 23.5±0.05ms                         |    1.08 | bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 10))              |
| +        | 21.3±0.04ms                            | 22.9±0.07ms                         |    1.08 | bench_function_base.Sort.time_sort('heap', 'int32', ('sorted_block', 100))               |
| +        | 21.8±0.03ms                            | 23.6±0.07ms                         |    1.08 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))             |
| +        | 21.3±0.03ms                            | 22.9±0.3ms                          |    1.08 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))             |
| +        | 21.3±0.04ms                            | 22.8±0.1ms                          |    1.07 | bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 100))              |
| +        | 608±20μs                               | 644±20μs                            |    1.06 | bench_function_base.Sort.time_argsort('merge', 'float32', ('uniform',))                  |
| +        | 21.9±0.05ms                            | 23.2±0.1ms                          |    1.06 | bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 100))             |
| +        | 21.8±0.05ms                            | 23.2±0.07ms                         |    1.06 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))            |
| -        | 79.9±0.1μs                             | 73.5±0.2μs                          |    0.92 | bench_function_base.Partition.time_argpartition('int16', ('sorted_block', 10), 10)       |
| -        | 79.9±0.1μs                             | 73.3±0.1μs                          |    0.92 | bench_function_base.Partition.time_argpartition('int16', ('sorted_block', 10), 100)      |
| -        | 80.9±0.2μs                             | 74.2±0.2μs                          |    0.92 | bench_function_base.Partition.time_argpartition('int16', ('sorted_block', 10), 1000)     |
| -        | 357±0.03μs                             | 321±3μs                             |    0.9  | bench_function_base.Partition.time_partition('float32', ('sorted_block', 100), 100)      |
| -        | 360±0.09μs                             | 316±3μs                             |    0.88 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 100), 1000)     |
| -        | 734±0.6μs                              | 645±3μs                             |    0.88 | bench_function_base.Sort.time_argsort('merge', 'float64', ('uniform',))                  |
| -        | 357±0.07μs                             | 310±5μs                             |    0.87 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 100), 10)       |
| -        | 311±0.2μs                              | 270±4μs                             |    0.87 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 1000), 10)        |
| -        | 734±0.8μs                              | 638±20μs                            |    0.87 | bench_function_base.Sort.time_argsort('merge', 'float64', ('ordered',))                  |
| -        | 81.0±0.3μs                             | 70.2±0.7μs                          |    0.87 | bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',))                     |
| -        | 81.0±0.2μs                             | 70.0±0.6μs                          |    0.86 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',))                      |
| -        | 312±0.2μs                              | 265±3μs                             |    0.85 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 1000), 100)       |
| -        | 310±0.2μs                              | 261±8μs                             |    0.84 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 1000), 1000)      |
| -        | 344±0.08μs                             | 270±3μs                             |    0.79 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 1000), 10)      |
| -        | 343±0.07μs                             | 270±5μs                             |    0.79 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 1000), 1000)    |
| -        | 439±0.09μs                             | 342±2μs                             |    0.78 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 10), 10)          |
| -        | 438±0.2μs                              | 343±3μs                             |    0.78 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 10), 1000)        |
| -        | 344±0.1μs                              | 265±7μs                             |    0.77 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 1000), 100)     |
| -        | 439±0.2μs                              | 338±3μs                             |    0.77 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 10), 100)         |
| -        | 474±0.07μs                             | 344±0.9μs                           |    0.73 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 10), 10)        |
| -        | 474±0.09μs                             | 348±2μs                             |    0.73 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 10), 100)       |
| -        | 473±0.08μs                             | 342±2μs                             |    0.72 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 10), 1000)      |
| -        | 14.6±0.01ms                            | 5.04±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('heap', 'int16', ('random',))                         |
| -        | 14.8±0.02ms                            | 5.17±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 100))               |
| -        | 14.7±0.03ms                            | 5.13±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 1000))              |
| -        | 14.6±0.01ms                            | 5.04±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',))                        |
| -        | 14.8±0.02ms                            | 5.16±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))              |
| -        | 14.8±0.02ms                            | 5.15±0.04ms                         |    0.35 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))             |
| -        | 15.0±0.01ms                            | 5.18±0.01ms                         |    0.34 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 10))                |
| -        | 15.0±0.01ms                            | 5.17±0.01ms                         |    0.34 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))               |
| -        | 225±0.07μs                             | 72.2±0.1μs                          |    0.32 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 10), 10)          |
| -        | 225±0.04μs                             | 72.2±0.2μs                          |    0.32 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 10), 100)         |
| -        | 225±0.08μs                             | 72.6±0.2μs                          |    0.32 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 10), 1000)        |
| -        | 15.0±0ms                               | 4.72±0.04ms                         |    0.32 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))                       |
| -        | 15.0±0ms                               | 4.71±0.07ms                         |    0.31 | bench_function_base.Sort.time_sort('heap', 'int16', ('ordered',))                        |
| -        | 246±0.1μs                              | 69.9±0.6μs                          |    0.28 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 100), 10)         |
| -        | 246±0.1μs                              | 69.2±0.5μs                          |    0.28 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 100), 100)        |
| -        | 247±0.05μs                             | 69.7±0.4μs                          |    0.28 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 100), 1000)       |
| -        | 209±0.1μs                              | 55.6±0.9μs                          |    0.27 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 1000), 10)        |
| -        | 209±0.2μs                              | 55.8±0.6μs                          |    0.27 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 1000), 1000)      |
| -        | 209±0.1μs                              | 55.0±0.2μs                          |    0.26 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 1000), 100)       |
| -        | 2.33±0ms                               | 514±10μs                            |    0.22 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10))              |
| -        | 2.33±0ms                               | 523±4μs                             |    0.22 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))             |
| -        | 2.15±0.01ms                            | 440±3μs                             |    0.21 | bench_function_base.Sort.time_sort('heap', 'float16', ('ordered',))                      |
| -        | 2.36±0.01ms                            | 495±6μs                             |    0.21 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))                       |
| -        | 2.15±0ms                               | 444±2μs                             |    0.21 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))                     |
| -        | 2.37±0ms                               | 499±10μs                            |    0.21 | bench_function_base.Sort.time_sort('quick', 'float16', ('random',))                      |
| -        | 2.31±0.01ms                            | 460±4μs                             |    0.2  | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))            |
| -        | 2.33±0.01ms                            | 454±4μs                             |    0.19 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))             |
| -        | 36.5±0.05ms                            | 5.74±0.06ms                         |    0.16 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10))              |
| -        | 37.0±0.08ms                            | 5.78±0.05ms                         |    0.16 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100))             |
| -        | 35.1±0.08ms                            | 5.56±0.1ms                          |    0.16 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 10))                |
| -        | 36.5±0.06ms                            | 5.68±0.09ms                         |    0.16 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))             |
| -        | 37.1±0.1ms                             | 5.85±0.2ms                          |    0.16 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))            |
| -        | 35.1±0.08ms                            | 5.56±0.07ms                         |    0.16 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))               |
| -        | 33.8±0.3ms                             | 5.05±0.06ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))                      |
| -        | 34.9±0.1ms                             | 5.12±0.02ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('random',))                       |
| -        | 34.0±0.4ms                             | 5.14±0.05ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',))                     |
| -        | 39.2±0.08ms                            | 5.97±0.08ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))            |
| -        | 32.5±0.3ms                             | 4.94±0.03ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('ordered',))                        |
| -        | 33.7±0.2ms                             | 4.98±0.01ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('random',))                         |
| -        | 32.8±0.3ms                             | 4.99±0.03ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('reversed',))                       |
| -        | 35.6±0.07ms                            | 5.34±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 100))               |
| -        | 37.7±0.1ms                             | 5.85±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 1000))              |
| -        | 33.8±0.2ms                             | 5.12±0.04ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))                     |
| -        | 35.0±0.2ms                             | 5.15±0.08ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))                      |
| -        | 33.9±0.2ms                             | 5.17±0.04ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))                    |
| -        | 39.3±0.1ms                             | 5.98±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))           |
| -        | 32.4±0.1ms                             | 4.94±0.05ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))                       |
| -        | 33.6±0.1ms                             | 5.19±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))                        |
| -        | 32.6±0.2ms                             | 5.03±0.09ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))                      |
| -        | 35.6±0.09ms                            | 5.50±0.06ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))              |
| -        | 37.8±0.08ms                            | 5.79±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))             |
| -        | 35.1±0.2ms                             | 5.31±0.03ms                         |    0.15 | bench_function_base.Sort.time_sort_worst                                                 |
| -        | 3.37±0.01ms                            | 440±4μs                             |    0.13 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))            |
| -        | 3.37±0.02ms                            | 437±6μs                             |    0.13 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000))           |
| -        | 331±0.03μs                             | 38.6±0.2μs                          |    0.12 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 10), 10)        |
| -        | 331±0.09μs                             | 38.6±0.06μs                         |    0.12 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 10), 100)       |
| -        | 331±0.08μs                             | 38.7±0.2μs                          |    0.12 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 10), 1000)      |
| -        | 323±0.1μs                              | 34.0±0.2μs                          |    0.11 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 100), 10)       |
| -        | 323±0.08μs                             | 34.2±0.1μs                          |    0.11 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 100), 100)      |
| -        | 323±0.05μs                             | 33.8±0.07μs                         |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 100), 1000)     |
| -        | 288±0.2μs                              | 29.9±0.2μs                          |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 1000), 10)      |
| -        | 288±0.07μs                             | 29.9±0.2μs                          |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 1000), 100)     |
| -        | 287±0.05μs                             | 29.9±0.03μs                         |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 1000), 1000)    |
| -        | 537±1μs                                | 50.1±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 10), 10)        |
| -        | 536±0.2μs                              | 50.1±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 10), 100)       |
| -        | 536±0.1μs                              | 50.3±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 10), 1000)      |
| -        | 484±0.7μs                              | 44.6±0.07μs                         |    0.09 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 10), 10)          |
| -        | 484±0.2μs                              | 44.6±0.2μs                          |    0.09 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 10), 100)         |
| -        | 485±0.2μs                              | 44.6±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 10), 1000)        |
| -        | 639±0.7μs                              | 48.8±0.3μs                          |    0.08 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 1000), 10)      |
| -        | 639±0.8μs                              | 48.5±0.5μs                          |    0.08 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 1000), 100)     |
| -        | 635±0.08μs                             | 48.9±0.2μs                          |    0.08 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 1000), 1000)    |
| -        | 582±0.5μs                              | 43.7±0.3μs                          |    0.08 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 1000), 100)       |
| -        | 578±0.9μs                              | 43.8±0.2μs                          |    0.08 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 1000), 1000)      |
| -        | 581±0.5μs                              | 43.4±0.1μs                          |    0.07 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 1000), 10)        |
| -        | 807±0.1μs                              | 51.6±0.4μs                          |    0.06 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 100), 10)       |
| -        | 807±0.3μs                              | 51.2±0.3μs                          |    0.06 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 100), 100)      |
| -        | 808±0.4μs                              | 52.0±0.4μs                          |    0.06 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 100), 1000)     |
| -        | 743±0.7μs                              | 46.6±0.2μs                          |    0.06 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 100), 10)         |
| -        | 743±0.5μs                              | 46.6±0.3μs                          |    0.06 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 100), 100)        |
| -        | 745±1μs                                | 46.1±0.3μs                          |    0.06 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 100), 1000)       |

jan-wassenberg · 2025-09-29T12:48:42Z

FWIW -0 vs 0 was not specified because we just went with what IEEE754 says.
So the main driver seems to be that we just want to do the integer comparison after conditionally flipping bits. But the extra check might not add much to the cost of the bit flipping, it's one extra comparison which can run in parallel, plus a masked XOR (somehow we only have MaskedOr, not MaskedXor, but that can be added).
What do you think, would it be worth a try to see how (in)expensive that is?

jan-wassenberg · 2025-09-29T17:18:58Z

As discussed, I understand that the ordering of equivalent keys can anyway change in quicksort, so it's fine to keep the current -0 != 0.

- Replace std::iota with IotaWrapper for fp16 (ARM/ppc) - Use PositiveInfOrHighestValue in emulated traits - Rename IsNaN/IsInf Bin → Wrapper - Update tests and call sites (traits, order-emulate, vqsort)

jan-wassenberg

Nice simplifications, I especially like IsEmulatedMinMax, thanks!
Wrapper LGTM.
One remaining concern about inheritance of Descending from Ascending, apart from that good to go.

jan-wassenberg · 2025-09-30T10:37:48Z

hwy/contrib/sort/order-emulate-inl.h

+  using TF = typename Base::KeyType;
+
+  HWY_INLINE bool Compare1(const T* a, const T* b) const {
+    return reinterpret_cast<_Asc>(this)->Compare1(b, a);


reinterpret_cast and inheritance seems questionable here.
We could instead use a typedef like you have (but no _ prefix because that is reserved by C++), and instead write return AscBase().Compare(d, b, a) - WDYT?

Make sense, done

seiko2plus · 2025-09-30T11:28:26Z

apart from that good to go.

Not, yet. CI trigger errors due to missing float16_t's operator+ by clang & gcc compiler:
https://github.com/google/highway/actions/runs/18126110238/job/51581687220?pr=2737

/home/runner/work/highway/highway/hwy/contrib/sort/sort_unit_test.cc:84:73: error: invalid operands to binary expression ('hwy::float16_t' and 'hwy::float16_t')
    const Vec<D> epsp1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1) + hwy::Epsilon<TF>()));
                                                 ~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~
/home/runner/work/highway/highway/hwy/tests/test_util-inl.h:265:7: note: in instantiation of function template specialization 'hwy::N_SSSE3::(anonymous namespace)::TestFloatLargerSmaller::operator()<hwy::float16_t, hwy::N_SSSE3::Simd<hwy::float16_t, 8, 0>>' requested here
      Test()(T(), d);
      ^
/home/runner/work/highway/highway/hwy/tests/test_util-inl.h:411:61: note: in instantiation of member function 'hwy::N_SSSE3::detail::ForeachCappedR<hwy::float16_t, 8, 1, hwy::N_SSSE3::(anonymous namespace)::TestFloatLargerSmaller, 0>::Do' requested here

seems the following branch need to refine:

highway/hwy/base.h

Lines 1228 to 1250 in 0bb1960

    
           #ifndef HWY_HAVE_SCALAR_F16_TYPE  // allow override 
        
           // Compiler supports _Float16, not necessarily with operators. 
        
           #if HWY_NEON_HAVE_F16C || HWY_RVV_HAVE_F16_VEC || HWY_SSE2_HAVE_F16_TYPE || \ 
        
               __SPIRV_DEVICE__ 
        
           #define HWY_HAVE_SCALAR_F16_TYPE 1 
        
           #else 
        
           #define HWY_HAVE_SCALAR_F16_TYPE 0 
        
           #endif 
        
           #endif  // HWY_HAVE_SCALAR_F16_TYPE 
        
           #ifndef HWY_HAVE_SCALAR_F16_OPERATORS 
        
           // Recent enough compiler also has operators. 
        
           #if HWY_HAVE_SCALAR_F16_TYPE &&                                       \ 
        
               (HWY_COMPILER_CLANG >= 1800 || HWY_COMPILER_GCC_ACTUAL >= 1200 || \ 
        
                (HWY_COMPILER_CLANG >= 1500 && !HWY_COMPILER_CLANGCL &&          \ 
        
                 !defined(_WIN32)) ||                                            \ 
        
                (HWY_ARCH_ARM &&                                                 \ 
        
                 (HWY_COMPILER_CLANG >= 900 || HWY_COMPILER_GCC_ACTUAL >= 800))) 
        
           #define HWY_HAVE_SCALAR_F16_OPERATORS 1 
        
           #else 
        
           #define HWY_HAVE_SCALAR_F16_OPERATORS 0 
        
           #endif 
        
           #endif  // HWY_HAVE_SCALAR_F16_OPERATORS

seiko2plus · 2025-09-30T11:38:44Z

I'm going to avoid using operator + in here instead for now to bypass ci errors:

highway/hwy/contrib/sort/sort_unit_test.cc

Lines 80 to 85 in 746c513

    
           const Vec<D> p1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1))); 
        
           const Vec<D> pinf = BitCast(d, Set(du, ExponentMask<TF>())); 
        
           const Vec<D> peps = Set(d, BitCastScalar<T>(hwy::Epsilon<TF>())); 
        
           const Vec<D> pmax = Set(d, BitCastScalar<T>(hwy::HighestValue<TF>())); 
        
           const Vec<D> epsp1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1) + hwy::Epsilon<TF>())); 
        
           const Vec<D> epsn1 = Or(epsp1, n0);

jan-wassenberg · 2025-09-30T11:57:02Z

I'm going to avoid using operator + in here instead for now to bypass ci errors:

highway/hwy/contrib/sort/sort_unit_test.cc

Lines 80 to 85 in 746c513

const Vec<D> p1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1)));

const Vec<D> pinf = BitCast(d, Set(du, ExponentMask<TF>()));

const Vec<D> peps = Set(d, BitCastScalar<T>(hwy::Epsilon<TF>()));

const Vec<D> pmax = Set(d, BitCastScalar<T>(hwy::HighestValue<TF>()));

const Vec<D> epsp1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1) + hwy::Epsilon<TF>()));

const Vec<D> epsn1 = Or(epsp1, n0);

Agreed, what we usually do for bf16 and fp16 is to ConvertScalarTo<float>, add, then convert back to the actual type.

…tability)

seiko2plus · 2025-09-30T12:44:51Z

To clarify:

With flush-to-zero enabled, this emulation still handles subnormals, so it’s fine to treat them as integers by the following branches:

highway/hwy/contrib/sort/vqsort-inl.h

Lines 1283 to 1290 in 0bb1960

    
           if (!hwy::IsFloat<TFromD<D>>()) { 
        
             // OR of XOR-difference may be faster than comparison. 
        
             V diff = Zero(d); 
        
             for (size_t i = 0; i < kSampleLanes; i += N) { 
        
               const V v = Load(d, samples + i); 
        
               diff = OrXor(diff, first, v); 
        
             } 
        
             return st.NoKeyDifference(d, diff);

highway/hwy/contrib/sort/vqsort-inl.h

Lines 1434 to 1458 in 0bb1960

    
           // Disable the OrXor optimization for floats because OrXor will not treat 
        
           // subnormals the same as actual comparisons, leading to logic errors for 
        
           // 2-value cases. 
        
           if (!hwy::IsFloat<T>()) { 
        
             // Sticky bits registering any difference between `keys` and the first key. 
        
             // We use vector XOR because it may be cheaper than comparisons, especially 
        
             // for 128-bit. 2x unrolled for more ILP. 
        
             Vec<D> diff0 = zero; 
        
             Vec<D> diff1 = zero; 
        
             // We want to stop once a difference has been found, but without slowing 
        
             // down the loop by comparing during each iteration. The compromise is to 
        
             // compare after a 'group', which consists of kLoops times two vectors. 
        
             constexpr size_t kLoops = 8; 
        
             const size_t lanes_per_group = kLoops * 2 * N; 
        
             if (num >= lanes_per_group) { 
        
               for (; i <= num - lanes_per_group; i += lanes_per_group) { 
        
                 HWY_DEFAULT_UNROLL 
        
                 for (size_t loop = 0; loop < kLoops; ++loop) { 
        
                   const Vec<D> v0 = Load(d, keys + i + loop * 2 * N); 
        
                   const Vec<D> v1 = Load(d, keys + i + loop * 2 * N + N); 
        
                   diff0 = OrXor(diff0, v0, pivot); 
        
                   diff1 = OrXor(diff1, v1, pivot); 
        
                 }

NumPy CI (ENH: Migrate x86 quick/select sorting to Highway (part 1/2) numpy/numpy#29829) triggered errors, mostly related to select and MMX prefetch. I’ll follow up with fixes after addressing these issues.

seiko2plus · 2025-09-30T17:04:29Z

NumPy CI (numpy/numpy#29829) triggered errors, mostly related to select and MMX prefetch. I’ll follow up with fixes after addressing these issues.

Regarding VQselect, the runtime test failure was in NumPy’s test itself—since the order within partitions is undefined.

Now, all expected tests passed except for a _mm_prefetch build error on clang-cl:

umpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
  102 |   _mm_prefetch(reinterpret_cast<const char*>(p), _MM_HINT_T0);
      |   ^
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx

We pass -mno-mmx when targeting AVX512* due to a previously discovered bug that corrupts the x87 FP stack; I haven’t revisited this since. I don’t know why clang-cl requires MMX—wouldn’t SSE be enough? (Possibly something related to Windows) In any case, I think this PR is ready to go. I’ll follow up with another PR to fix this clang-cl build error by falling back to __builtin_prefetch when clang-cl is built with -mno-mmx.

jan-wassenberg

👍

jan-wassenberg · 2025-10-01T15:48:30Z

Internal tests/builds are unfortunately failing due to a build timeout in optimized asan builds for the f16 file. What do you think of splitting it up into separate files for Sort vs Select vs PartialSort? Disabling f16 for asan is probably not desirable.
Any other way we can reduce compile time?

seiko2plus · 2025-10-01T17:41:50Z

What do you think of splitting it up into separate files for Sort vs Select vs PartialSort?

That could be a straightforward fix. The current CPU dispatch approach can exhaust available physical memory, forcing the system to fall back on swap, which is likely why may hitting the limit.

Have you tried reducing the build parallelism e.g. -j4? That could be a quick fix.

Any other way we can reduce compile time?

Using PCH could help. Let me give it a try.

jan-wassenberg · 2025-10-02T14:02:15Z

Ah, possible that swapping might slow things down, yes. Unfortunately we don't have control over the build machines.

Good point about PCH. We can least clang header modules in the BUILD file. I'll try that.

johnplatts reviewed Sep 28, 2025

View reviewed changes

hwy/contrib/sort/order-emulate-inl.h Outdated Show resolved Hide resolved

seiko2plus force-pushed the vsort-emu-fp16 branch 2 times, most recently from 67ec002 to b9334bf Compare September 28, 2025 22:04

seiko2plus force-pushed the vsort-emu-fp16 branch from b9334bf to b97c259 Compare September 28, 2025 22:26

jan-wassenberg requested changes Sep 29, 2025

View reviewed changes

contrib/sort: fp16 iota fix and emulated helpers cleanup

9282c16

- Replace std::iota with IotaWrapper for fp16 (ARM/ppc) - Use PositiveInfOrHighestValue in emulated traits - Rename IsNaN/IsInf Bin → Wrapper - Update tests and call sites (traits, order-emulate, vqsort)

seiko2plus force-pushed the vsort-emu-fp16 branch from 45d5873 to 9282c16 Compare September 30, 2025 10:00

seiko2plus added a commit to seiko2plus/numpy that referenced this pull request Sep 30, 2025

Push Highway to google/highway#2737 needs for emulate f16/f64 sorting

3467d5f

jan-wassenberg requested changes Sep 30, 2025

View reviewed changes

contrib/sort: Simplify SortDescending in emulated order

746c513

seiko2plus marked this pull request as ready for review September 30, 2025 11:23

contrib/sort: use AddWrapper for hwy::float16_t; avoid operator+ (por…

0f85212

…tability)

seiko2plus added a commit to seiko2plus/numpy that referenced this pull request Sep 30, 2025

Push Highway to google/highway#2737 needs for emulate f16/f64 sorting

cafd214

seiko2plus mentioned this pull request Oct 1, 2025

BLD: fallback to __builtin_prefetch on clang-cl with -mno-mmx #2742

Merged

jan-wassenberg approved these changes Oct 1, 2025

View reviewed changes

jan-wassenberg added the ready to pull label Oct 1, 2025

contrib/sort: enable f16/f64 sorting on no native FP #2737

Are you sure you want to change the base?

contrib/sort: enable f16/f64 sorting on no native FP #2737

Uh oh!

Conversation

seiko2plus commented Sep 28, 2025

Uh oh!

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jan-wassenberg Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

seiko2plus Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

seiko2plus Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

seiko2plus commented Sep 29, 2025

Uh oh!

jan-wassenberg commented Sep 29, 2025

Uh oh!

jan-wassenberg commented Sep 29, 2025

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

seiko2plus Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

seiko2plus commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seiko2plus commented Sep 30, 2025

Uh oh!

jan-wassenberg commented Sep 30, 2025

Uh oh!

seiko2plus commented Sep 30, 2025

Uh oh!

seiko2plus commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg commented Oct 1, 2025

Uh oh!

seiko2plus commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jan-wassenberg commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

seiko2plus commented Sep 30, 2025 •

edited

Loading

seiko2plus commented Sep 30, 2025 •

edited

Loading

seiko2plus commented Oct 1, 2025 •

edited

Loading