Skip to content

Conversation

@seiko2plus
Copy link
Contributor

Add an integer-domain total-order comparator and use it for float16_t/double on targets lacking native fp operations. Preserves IEEE intent (e.g., -0.0 < +0.0); Wires into sort/select traits, removes FP-only guards, and enables basic tests.

This is part of enabling Highway vqsort for NumPy on x86, as intel-x86-sort provides emulation for half-precision

@seiko2plus seiko2plus force-pushed the vsort-emu-fp16 branch 2 times, most recently from 67ec002 to b9334bf Compare September 28, 2025 22:04
…ive FP)

Add an integer-domain total-order comparator and use it for float16_t/double
on targets lacking native fp operations. Preserves IEEE intent (e.g., -0.0 < +0.0);
Wires into sort/select traits, removes FP-only guards, and enables basic tests.

This is part of enabling Highway vqsort for NumPy on x86.
Copy link
Member

@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, it's great to make this unconditionally available.
Generally LGTM but the -0 != 0 part is surprising, do we really need that?

}

template <class TF, class V, class D = DFromV<V>, HWY_IF_UNSIGNED_D(D)>
HWY_API Mask<D> IsInfBin(V v) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we improve the name somehow? Maybe IsInfWrapper or IsInfEmulated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done IsInfWrapper/IsNaNWrapper

auto in = hwy::AllocateAligned<T>(num);
HWY_ASSERT(in);
Fill(d, GetLane(Inf(d)), num, in.get());
Fill(d, BitCastScalar<T>(ExponentMask<TF>()), num, in.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also make a wrapper function for Inf() in the same way as IsInf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just discovered NegativeInfOrLowestValue/PositiveInfOrHighestValue which can be used instead

@seiko2plus
Copy link
Contributor Author

but the -0 != 0 part is surprising, do we really need that?

IMHO, treating -0.0 < 0.0, and -0.0 != 0.0 as true is numerically reasonable for sorting. Enforcing (-0 == 0) == true and (-0 < 0) == false would require extra mask/blend steps to emulate floating-point behavior, which seems unnecessary here and adds overhead. If I’m not mistaken, the ordering of ±0 is currently unspecified, so allowing -0.0 < 0.0 is acceptable and improves performance.

For reference, here’s a benchmark: hwy::vqsort vs Intel’s sort using the NumPy benchmark suite.

NumPy’s upcoming patch (abandoning x86-simd-sort), tested on an AMD Ryzen 7 7700X (no native FP16 support), Clang 19.1.7

| Change   | Before [4486271c] <hwy-x86-vqsort~4>   | After [3310265e] <hwy-x86-vqsort>   |   Ratio | Benchmark (Parameter)                                                                    |
|----------|----------------------------------------|-------------------------------------|---------|------------------------------------------------------------------------------------------|
| +        | 19.4±0.08ms                            | 23.5±0.3ms                          |    1.21 | bench_function_base.Sort.time_sort('heap', 'uint32', ('reversed',))                      |
| +        | 19.4±0.01ms                            | 23.5±0.3ms                          |    1.21 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))                      |
| +        | 19.4±0.01ms                            | 23.4±0.7ms                          |    1.2  | bench_function_base.Sort.time_sort('heap', 'int32', ('reversed',))                       |
| +        | 19.5±0.01ms                            | 23.4±0.6ms                          |    1.2  | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))                      |
| +        | 19.4±0.01ms                            | 23.3±0.4ms                          |    1.2  | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))                     |
| +        | 19.5±0.01ms                            | 23.1±0.2ms                          |    1.19 | bench_function_base.Sort.time_sort('heap', 'int32', ('ordered',))                        |
| +        | 19.5±0.01ms                            | 23.1±0.3ms                          |    1.19 | bench_function_base.Sort.time_sort('heap', 'uint32', ('ordered',))                       |
| +        | 19.5±0.01ms                            | 23.1±0.3ms                          |    1.19 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))                       |
| +        | 598±20μs                               | 701±40μs                            |    1.17 | bench_function_base.Sort.time_argsort('merge', 'float32', ('ordered',))                  |
| +        | 19.9±0.1ms                             | 23.4±0.2ms                          |    1.17 | bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))                      |
| +        | 19.9±0.07ms                            | 23.3±0.3ms                          |    1.17 | bench_function_base.Sort.time_sort('heap', 'float32', ('reversed',))                     |
| +        | 19.9±0.04ms                            | 23.3±0.5ms                          |    1.17 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))                    |
| +        | 19.9±0.02ms                            | 23.1±0.6ms                          |    1.16 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))                     |
| +        | 205±0.6μs                              | 229±0.6μs                           |    1.11 | bench_function_base.Partition.time_argpartition('float16', ('sorted_block', 1000), 10)   |
| +        | 206±0.5μs                              | 229±0.3μs                           |    1.11 | bench_function_base.Partition.time_argpartition('float16', ('sorted_block', 1000), 100)  |
| +        | 207±0.08μs                             | 231±0.2μs                           |    1.11 | bench_function_base.Partition.time_argpartition('float16', ('sorted_block', 1000), 1000) |
| +        | 20.3±0.03ms                            | 22.5±0.05ms                         |    1.11 | bench_function_base.Sort.time_sort('heap', 'int32', ('random',))                         |
| +        | 20.3±0.08ms                            | 22.5±0.03ms                         |    1.11 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))                        |
| +        | 20.3±0.01ms                            | 22.5±0.05ms                         |    1.11 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))                       |
| +        | 20.4±0.1ms                             | 22.5±0.03ms                         |    1.1  | bench_function_base.Sort.time_sort('heap', 'uint32', ('random',))                        |
| +        | 20.9±0.1ms                             | 22.8±0.02ms                         |    1.09 | bench_function_base.Sort.time_sort('heap', 'float32', ('random',))                       |
| +        | 21.2±0.01ms                            | 23.1±0.05ms                         |    1.09 | bench_function_base.Sort.time_sort('heap', 'int32', ('sorted_block', 10))                |
| +        | 21.2±0.02ms                            | 23.2±0.03ms                         |    1.09 | bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 10))               |
| +        | 20.9±0.08ms                            | 22.8±0.2ms                          |    1.09 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))                      |
| +        | 21.3±0.03ms                            | 23.3±0.04ms                         |    1.09 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))               |
| +        | 21.3±0.06ms                            | 23.1±0.2ms                          |    1.09 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))              |
| +        | 21.3±0.04ms                            | 23.1±0.06ms                         |    1.09 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))              |
| +        | 21.8±0.03ms                            | 23.5±0.05ms                         |    1.08 | bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 10))              |
| +        | 21.3±0.04ms                            | 22.9±0.07ms                         |    1.08 | bench_function_base.Sort.time_sort('heap', 'int32', ('sorted_block', 100))               |
| +        | 21.8±0.03ms                            | 23.6±0.07ms                         |    1.08 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))             |
| +        | 21.3±0.03ms                            | 22.9±0.3ms                          |    1.08 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))             |
| +        | 21.3±0.04ms                            | 22.8±0.1ms                          |    1.07 | bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 100))              |
| +        | 608±20μs                               | 644±20μs                            |    1.06 | bench_function_base.Sort.time_argsort('merge', 'float32', ('uniform',))                  |
| +        | 21.9±0.05ms                            | 23.2±0.1ms                          |    1.06 | bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 100))             |
| +        | 21.8±0.05ms                            | 23.2±0.07ms                         |    1.06 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))            |
| -        | 79.9±0.1μs                             | 73.5±0.2μs                          |    0.92 | bench_function_base.Partition.time_argpartition('int16', ('sorted_block', 10), 10)       |
| -        | 79.9±0.1μs                             | 73.3±0.1μs                          |    0.92 | bench_function_base.Partition.time_argpartition('int16', ('sorted_block', 10), 100)      |
| -        | 80.9±0.2μs                             | 74.2±0.2μs                          |    0.92 | bench_function_base.Partition.time_argpartition('int16', ('sorted_block', 10), 1000)     |
| -        | 357±0.03μs                             | 321±3μs                             |    0.9  | bench_function_base.Partition.time_partition('float32', ('sorted_block', 100), 100)      |
| -        | 360±0.09μs                             | 316±3μs                             |    0.88 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 100), 1000)     |
| -        | 734±0.6μs                              | 645±3μs                             |    0.88 | bench_function_base.Sort.time_argsort('merge', 'float64', ('uniform',))                  |
| -        | 357±0.07μs                             | 310±5μs                             |    0.87 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 100), 10)       |
| -        | 311±0.2μs                              | 270±4μs                             |    0.87 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 1000), 10)        |
| -        | 734±0.8μs                              | 638±20μs                            |    0.87 | bench_function_base.Sort.time_argsort('merge', 'float64', ('ordered',))                  |
| -        | 81.0±0.3μs                             | 70.2±0.7μs                          |    0.87 | bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',))                     |
| -        | 81.0±0.2μs                             | 70.0±0.6μs                          |    0.86 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',))                      |
| -        | 312±0.2μs                              | 265±3μs                             |    0.85 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 1000), 100)       |
| -        | 310±0.2μs                              | 261±8μs                             |    0.84 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 1000), 1000)      |
| -        | 344±0.08μs                             | 270±3μs                             |    0.79 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 1000), 10)      |
| -        | 343±0.07μs                             | 270±5μs                             |    0.79 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 1000), 1000)    |
| -        | 439±0.09μs                             | 342±2μs                             |    0.78 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 10), 10)          |
| -        | 438±0.2μs                              | 343±3μs                             |    0.78 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 10), 1000)        |
| -        | 344±0.1μs                              | 265±7μs                             |    0.77 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 1000), 100)     |
| -        | 439±0.2μs                              | 338±3μs                             |    0.77 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 10), 100)         |
| -        | 474±0.07μs                             | 344±0.9μs                           |    0.73 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 10), 10)        |
| -        | 474±0.09μs                             | 348±2μs                             |    0.73 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 10), 100)       |
| -        | 473±0.08μs                             | 342±2μs                             |    0.72 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 10), 1000)      |
| -        | 14.6±0.01ms                            | 5.04±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('heap', 'int16', ('random',))                         |
| -        | 14.8±0.02ms                            | 5.17±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 100))               |
| -        | 14.7±0.03ms                            | 5.13±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 1000))              |
| -        | 14.6±0.01ms                            | 5.04±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',))                        |
| -        | 14.8±0.02ms                            | 5.16±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))              |
| -        | 14.8±0.02ms                            | 5.15±0.04ms                         |    0.35 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))             |
| -        | 15.0±0.01ms                            | 5.18±0.01ms                         |    0.34 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 10))                |
| -        | 15.0±0.01ms                            | 5.17±0.01ms                         |    0.34 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))               |
| -        | 225±0.07μs                             | 72.2±0.1μs                          |    0.32 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 10), 10)          |
| -        | 225±0.04μs                             | 72.2±0.2μs                          |    0.32 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 10), 100)         |
| -        | 225±0.08μs                             | 72.6±0.2μs                          |    0.32 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 10), 1000)        |
| -        | 15.0±0ms                               | 4.72±0.04ms                         |    0.32 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))                       |
| -        | 15.0±0ms                               | 4.71±0.07ms                         |    0.31 | bench_function_base.Sort.time_sort('heap', 'int16', ('ordered',))                        |
| -        | 246±0.1μs                              | 69.9±0.6μs                          |    0.28 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 100), 10)         |
| -        | 246±0.1μs                              | 69.2±0.5μs                          |    0.28 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 100), 100)        |
| -        | 247±0.05μs                             | 69.7±0.4μs                          |    0.28 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 100), 1000)       |
| -        | 209±0.1μs                              | 55.6±0.9μs                          |    0.27 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 1000), 10)        |
| -        | 209±0.2μs                              | 55.8±0.6μs                          |    0.27 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 1000), 1000)      |
| -        | 209±0.1μs                              | 55.0±0.2μs                          |    0.26 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 1000), 100)       |
| -        | 2.33±0ms                               | 514±10μs                            |    0.22 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10))              |
| -        | 2.33±0ms                               | 523±4μs                             |    0.22 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))             |
| -        | 2.15±0.01ms                            | 440±3μs                             |    0.21 | bench_function_base.Sort.time_sort('heap', 'float16', ('ordered',))                      |
| -        | 2.36±0.01ms                            | 495±6μs                             |    0.21 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))                       |
| -        | 2.15±0ms                               | 444±2μs                             |    0.21 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))                     |
| -        | 2.37±0ms                               | 499±10μs                            |    0.21 | bench_function_base.Sort.time_sort('quick', 'float16', ('random',))                      |
| -        | 2.31±0.01ms                            | 460±4μs                             |    0.2  | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))            |
| -        | 2.33±0.01ms                            | 454±4μs                             |    0.19 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))             |
| -        | 36.5±0.05ms                            | 5.74±0.06ms                         |    0.16 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10))              |
| -        | 37.0±0.08ms                            | 5.78±0.05ms                         |    0.16 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100))             |
| -        | 35.1±0.08ms                            | 5.56±0.1ms                          |    0.16 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 10))                |
| -        | 36.5±0.06ms                            | 5.68±0.09ms                         |    0.16 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))             |
| -        | 37.1±0.1ms                             | 5.85±0.2ms                          |    0.16 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))            |
| -        | 35.1±0.08ms                            | 5.56±0.07ms                         |    0.16 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))               |
| -        | 33.8±0.3ms                             | 5.05±0.06ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))                      |
| -        | 34.9±0.1ms                             | 5.12±0.02ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('random',))                       |
| -        | 34.0±0.4ms                             | 5.14±0.05ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',))                     |
| -        | 39.2±0.08ms                            | 5.97±0.08ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))            |
| -        | 32.5±0.3ms                             | 4.94±0.03ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('ordered',))                        |
| -        | 33.7±0.2ms                             | 4.98±0.01ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('random',))                         |
| -        | 32.8±0.3ms                             | 4.99±0.03ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('reversed',))                       |
| -        | 35.6±0.07ms                            | 5.34±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 100))               |
| -        | 37.7±0.1ms                             | 5.85±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 1000))              |
| -        | 33.8±0.2ms                             | 5.12±0.04ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))                     |
| -        | 35.0±0.2ms                             | 5.15±0.08ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))                      |
| -        | 33.9±0.2ms                             | 5.17±0.04ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))                    |
| -        | 39.3±0.1ms                             | 5.98±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))           |
| -        | 32.4±0.1ms                             | 4.94±0.05ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))                       |
| -        | 33.6±0.1ms                             | 5.19±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))                        |
| -        | 32.6±0.2ms                             | 5.03±0.09ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))                      |
| -        | 35.6±0.09ms                            | 5.50±0.06ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))              |
| -        | 37.8±0.08ms                            | 5.79±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))             |
| -        | 35.1±0.2ms                             | 5.31±0.03ms                         |    0.15 | bench_function_base.Sort.time_sort_worst                                                 |
| -        | 3.37±0.01ms                            | 440±4μs                             |    0.13 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))            |
| -        | 3.37±0.02ms                            | 437±6μs                             |    0.13 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000))           |
| -        | 331±0.03μs                             | 38.6±0.2μs                          |    0.12 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 10), 10)        |
| -        | 331±0.09μs                             | 38.6±0.06μs                         |    0.12 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 10), 100)       |
| -        | 331±0.08μs                             | 38.7±0.2μs                          |    0.12 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 10), 1000)      |
| -        | 323±0.1μs                              | 34.0±0.2μs                          |    0.11 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 100), 10)       |
| -        | 323±0.08μs                             | 34.2±0.1μs                          |    0.11 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 100), 100)      |
| -        | 323±0.05μs                             | 33.8±0.07μs                         |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 100), 1000)     |
| -        | 288±0.2μs                              | 29.9±0.2μs                          |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 1000), 10)      |
| -        | 288±0.07μs                             | 29.9±0.2μs                          |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 1000), 100)     |
| -        | 287±0.05μs                             | 29.9±0.03μs                         |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 1000), 1000)    |
| -        | 537±1μs                                | 50.1±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 10), 10)        |
| -        | 536±0.2μs                              | 50.1±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 10), 100)       |
| -        | 536±0.1μs                              | 50.3±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 10), 1000)      |
| -        | 484±0.7μs                              | 44.6±0.07μs                         |    0.09 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 10), 10)          |
| -        | 484±0.2μs                              | 44.6±0.2μs                          |    0.09 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 10), 100)         |
| -        | 485±0.2μs                              | 44.6±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 10), 1000)        |
| -        | 639±0.7μs                              | 48.8±0.3μs                          |    0.08 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 1000), 10)      |
| -        | 639±0.8μs                              | 48.5±0.5μs                          |    0.08 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 1000), 100)     |
| -        | 635±0.08μs                             | 48.9±0.2μs                          |    0.08 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 1000), 1000)    |
| -        | 582±0.5μs                              | 43.7±0.3μs                          |    0.08 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 1000), 100)       |
| -        | 578±0.9μs                              | 43.8±0.2μs                          |    0.08 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 1000), 1000)      |
| -        | 581±0.5μs                              | 43.4±0.1μs                          |    0.07 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 1000), 10)        |
| -        | 807±0.1μs                              | 51.6±0.4μs                          |    0.06 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 100), 10)       |
| -        | 807±0.3μs                              | 51.2±0.3μs                          |    0.06 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 100), 100)      |
| -        | 808±0.4μs                              | 52.0±0.4μs                          |    0.06 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 100), 1000)     |
| -        | 743±0.7μs                              | 46.6±0.2μs                          |    0.06 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 100), 10)         |
| -        | 743±0.5μs                              | 46.6±0.3μs                          |    0.06 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 100), 100)        |
| -        | 745±1μs                                | 46.1±0.3μs                          |    0.06 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 100), 1000)       |

@jan-wassenberg
Copy link
Member

FWIW -0 vs 0 was not specified because we just went with what IEEE754 says.
So the main driver seems to be that we just want to do the integer comparison after conditionally flipping bits. But the extra check might not add much to the cost of the bit flipping, it's one extra comparison which can run in parallel, plus a masked XOR (somehow we only have MaskedOr, not MaskedXor, but that can be added).
What do you think, would it be worth a try to see how (in)expensive that is?

@jan-wassenberg
Copy link
Member

As discussed, I understand that the ordering of equivalent keys can anyway change in quicksort, so it's fine to keep the current -0 != 0.

- Replace std::iota with IotaWrapper for fp16 (ARM/ppc)
- Use PositiveInfOrHighestValue in emulated traits
- Rename IsNaN/IsInf Bin → Wrapper
- Update tests and call sites (traits, order-emulate, vqsort)
Copy link
Member

@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplifications, I especially like IsEmulatedMinMax, thanks!
Wrapper LGTM.
One remaining concern about inheritance of Descending from Ascending, apart from that good to go.

using TF = typename Base::KeyType;

HWY_INLINE bool Compare1(const T* a, const T* b) const {
return reinterpret_cast<_Asc>(this)->Compare1(b, a);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast and inheritance seems questionable here.
We could instead use a typedef like you have (but no _ prefix because that is reserved by C++), and instead write return AscBase().Compare(d, b, a) - WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, done

@seiko2plus seiko2plus marked this pull request as ready for review September 30, 2025 11:23
@seiko2plus
Copy link
Contributor Author

seiko2plus commented Sep 30, 2025

apart from that good to go.

Not, yet. CI trigger errors due to missing float16_t's operator+ by clang & gcc compiler:
https://github.com/google/highway/actions/runs/18126110238/job/51581687220?pr=2737

/home/runner/work/highway/highway/hwy/contrib/sort/sort_unit_test.cc:84:73: error: invalid operands to binary expression ('hwy::float16_t' and 'hwy::float16_t')
    const Vec<D> epsp1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1) + hwy::Epsilon<TF>()));
                                                 ~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~
/home/runner/work/highway/highway/hwy/tests/test_util-inl.h:265:7: note: in instantiation of function template specialization 'hwy::N_SSSE3::(anonymous namespace)::TestFloatLargerSmaller::operator()<hwy::float16_t, hwy::N_SSSE3::Simd<hwy::float16_t, 8, 0>>' requested here
      Test()(T(), d);
      ^
/home/runner/work/highway/highway/hwy/tests/test_util-inl.h:411:61: note: in instantiation of member function 'hwy::N_SSSE3::detail::ForeachCappedR<hwy::float16_t, 8, 1, hwy::N_SSSE3::(anonymous namespace)::TestFloatLargerSmaller, 0>::Do' requested here

seems the following branch need to refine:

highway/hwy/base.h

Lines 1228 to 1250 in 0bb1960

#ifndef HWY_HAVE_SCALAR_F16_TYPE // allow override
// Compiler supports _Float16, not necessarily with operators.
#if HWY_NEON_HAVE_F16C || HWY_RVV_HAVE_F16_VEC || HWY_SSE2_HAVE_F16_TYPE || \
__SPIRV_DEVICE__
#define HWY_HAVE_SCALAR_F16_TYPE 1
#else
#define HWY_HAVE_SCALAR_F16_TYPE 0
#endif
#endif // HWY_HAVE_SCALAR_F16_TYPE
#ifndef HWY_HAVE_SCALAR_F16_OPERATORS
// Recent enough compiler also has operators.
#if HWY_HAVE_SCALAR_F16_TYPE && \
(HWY_COMPILER_CLANG >= 1800 || HWY_COMPILER_GCC_ACTUAL >= 1200 || \
(HWY_COMPILER_CLANG >= 1500 && !HWY_COMPILER_CLANGCL && \
!defined(_WIN32)) || \
(HWY_ARCH_ARM && \
(HWY_COMPILER_CLANG >= 900 || HWY_COMPILER_GCC_ACTUAL >= 800)))
#define HWY_HAVE_SCALAR_F16_OPERATORS 1
#else
#define HWY_HAVE_SCALAR_F16_OPERATORS 0
#endif
#endif // HWY_HAVE_SCALAR_F16_OPERATORS

@seiko2plus
Copy link
Contributor Author

I'm going to avoid using operator + in here instead for now to bypass ci errors:

const Vec<D> p1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1)));
const Vec<D> pinf = BitCast(d, Set(du, ExponentMask<TF>()));
const Vec<D> peps = Set(d, BitCastScalar<T>(hwy::Epsilon<TF>()));
const Vec<D> pmax = Set(d, BitCastScalar<T>(hwy::HighestValue<TF>()));
const Vec<D> epsp1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1) + hwy::Epsilon<TF>()));
const Vec<D> epsn1 = Or(epsp1, n0);

@jan-wassenberg
Copy link
Member

I'm going to avoid using operator + in here instead for now to bypass ci errors:

const Vec<D> p1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1)));
const Vec<D> pinf = BitCast(d, Set(du, ExponentMask<TF>()));
const Vec<D> peps = Set(d, BitCastScalar<T>(hwy::Epsilon<TF>()));
const Vec<D> pmax = Set(d, BitCastScalar<T>(hwy::HighestValue<TF>()));
const Vec<D> epsp1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1) + hwy::Epsilon<TF>()));
const Vec<D> epsn1 = Or(epsp1, n0);

Agreed, what we usually do for bf16 and fp16 is to ConvertScalarTo<float>, add, then convert back to the actual type.

@seiko2plus
Copy link
Contributor Author

To clarify:

  • With flush-to-zero enabled, this emulation still handles subnormals, so it’s fine to treat them as integers by the following branches:

    if (!hwy::IsFloat<TFromD<D>>()) {
    // OR of XOR-difference may be faster than comparison.
    V diff = Zero(d);
    for (size_t i = 0; i < kSampleLanes; i += N) {
    const V v = Load(d, samples + i);
    diff = OrXor(diff, first, v);
    }
    return st.NoKeyDifference(d, diff);

  • // Disable the OrXor optimization for floats because OrXor will not treat
    // subnormals the same as actual comparisons, leading to logic errors for
    // 2-value cases.
    if (!hwy::IsFloat<T>()) {
    // Sticky bits registering any difference between `keys` and the first key.
    // We use vector XOR because it may be cheaper than comparisons, especially
    // for 128-bit. 2x unrolled for more ILP.
    Vec<D> diff0 = zero;
    Vec<D> diff1 = zero;
    // We want to stop once a difference has been found, but without slowing
    // down the loop by comparing during each iteration. The compromise is to
    // compare after a 'group', which consists of kLoops times two vectors.
    constexpr size_t kLoops = 8;
    const size_t lanes_per_group = kLoops * 2 * N;
    if (num >= lanes_per_group) {
    for (; i <= num - lanes_per_group; i += lanes_per_group) {
    HWY_DEFAULT_UNROLL
    for (size_t loop = 0; loop < kLoops; ++loop) {
    const Vec<D> v0 = Load(d, keys + i + loop * 2 * N);
    const Vec<D> v1 = Load(d, keys + i + loop * 2 * N + N);
    diff0 = OrXor(diff0, v0, pivot);
    diff1 = OrXor(diff1, v1, pivot);
    }

  • NumPy CI (ENH: Migrate x86 quick/select sorting to Highway (part 1/2) numpy/numpy#29829) triggered errors, mostly related to select and MMX prefetch. I’ll follow up with fixes after addressing these issues.

seiko2plus added a commit to seiko2plus/numpy that referenced this pull request Sep 30, 2025
@seiko2plus
Copy link
Contributor Author

seiko2plus commented Sep 30, 2025

NumPy CI (numpy/numpy#29829) triggered errors, mostly related to select and MMX prefetch. I’ll follow up with fixes after addressing these issues.

Regarding VQselect, the runtime test failure was in NumPy’s test itself—since the order within partitions is undefined.

Now, all expected tests passed except for a _mm_prefetch build error on clang-cl:

umpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
  102 |   _mm_prefetch(reinterpret_cast<const char*>(p), _MM_HINT_T0);
      |   ^
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx

We pass -mno-mmx when targeting AVX512* due to a previously discovered bug that corrupts the x87 FP stack; I haven’t revisited this since. I don’t know why clang-cl requires MMX—wouldn’t SSE be enough? (Possibly something related to Windows) In any case, I think this PR is ready to go. I’ll follow up with another PR to fix this clang-cl build error by falling back to __builtin_prefetch when clang-cl is built with -mno-mmx.

Copy link
Member

@jan-wassenberg jan-wassenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jan-wassenberg
Copy link
Member

Internal tests/builds are unfortunately failing due to a build timeout in optimized asan builds for the f16 file. What do you think of splitting it up into separate files for Sort vs Select vs PartialSort? Disabling f16 for asan is probably not desirable.
Any other way we can reduce compile time?

@seiko2plus
Copy link
Contributor Author

seiko2plus commented Oct 1, 2025

What do you think of splitting it up into separate files for Sort vs Select vs PartialSort?

That could be a straightforward fix. The current CPU dispatch approach can exhaust available physical memory, forcing the system to fall back on swap, which is likely why may hitting the limit.

Have you tried reducing the build parallelism e.g. -j4? That could be a quick fix.

Any other way we can reduce compile time?

Using PCH could help. Let me give it a try.

@jan-wassenberg
Copy link
Member

Ah, possible that swapping might slow things down, yes. Unfortunately we don't have control over the build machines.

Good point about PCH. We can least clang header modules in the BUILD file. I'll try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants