Fast methods I came up with for performing bit manipulation on the "diagonals" of a 8x8 bit table/bitboard, when represented using a 64-bit integer. The following operations are included:
- A "diagonal shift", where each byte is bit shifted left/right one more/less than the previous.
- Extracting the "diagonal bits" so they are placed horizontally next to each other.
- Depositing the bits from a u8 into the bytes from the bitboard (equivalent to undoing a diagonal or vertical extract)
Could be useful for chess programming when using a classic bitboard. E.g. Quickly calculating non-blocked moves for a bishop when used in conjunction with count leading/trailing zeros.
Functions with the '_SSE' suffix
Work because each row is 8-bits wide so they can be operated on individually using special SIMD instructions. Only uses SSE2 instructions so it can work on any x86 64-bit CPU.
Unfortunately, there is no 8-bit shift or multiply instruction, nor is there a "shift by value in corresponding vector" until AVX2, so we must do some packing/unpacking
(_mm_sll_epi16
shifts every element in the first vector by the value in the lower 64-bits of the second).
- Load into 128-bit vector and interleave with zero. Effectively zero extends 8-bit to 16-bit.
- Multiply the 16-bit element with powers of two, which acts like a shift.
- Un-interleave by zeroing out the upper 8-bits and convert its 16-bit elements to 8-bit using
_mm_packus_epi16
saturation conversion. - Return the lower 64-bits
If it needs to shift right, it instead interleaves the lower 8-bits with zero and takes the high result of the multiplication (the upper 16-bits of the intermediate 32-bit result).
When multiplying by
Functions with the '_bin' suffix
Works similarly to the "binary search" based method when calculating the number of trailing zeros.
- Bytes that are unaffected are saved using a mask
- For shifted bytes, only the bits came from that byte are saved
- The two parts are combined using a bitwise OR
Start << 2 << 1
ABCD ABCD ABCD
ABCD ABCD BCD- |
ABCD CD-- | CD00
ABCD CD-- | D00- |
Measured as time taken to calculate 1 billion results. Input was from an array of random valued 64-bit ints (n=20k).
The 'Perf' column is in seconds. On my Ryzen 5 5500
using GCC at -O3
, -march=native
or the default.
Visualization
For example, on a 3x3 table, a bottom-to-left diagonal shift would look like:
ABC'ABC'ABC => ABC'BC_'C__
ABC ABC
ABC => BC_
ABC C__
Method | Perf (m=n) | Perf (m=x86-64) |
---|---|---|
Linear | 3.46 | 3.38 |
"Binary" | 1.46 | 1.47 |
SSE left | 0.62 | 0.62 |
SSE right | 0.73 | 0.62 |
The SSE2 based methods calculates a diagonal shift-to-the-left, and then extract the most significant bits of each 8-bit element using _mm_movemask_epi8
.
The "SAD" method masks out the diagonal bits and use _mm_sad_epu8
with zero to effectively bitwise OR between all elements.
Note: The forward method outputs the result in reverse order.
Visualization
ABC
EFG => AFJ
HIJ
The pext
method uses a single BMI2 intrinsic. Not available on older CPUs.
Method | Perf (m=n) | Perf (m=x86-64) |
---|---|---|
Pseudo rot45 | 1.20 | 1.22 |
Psudo rot45 B | 1.10 | 1.12 |
SSE | 0.64 | 0.73 |
pext | 0.77 | N/A |
SAD | 0.55 | 0.55 |
Effectively just a broadcast to every byte and a mask.
Visualization
A00
ABC => 0B0
00C
Method | Perf (m=n) | Perf (m=x86-64) |
---|---|---|
Mul. | 0.75 | 0.75 |
SSE | 0.56 | 0.60 |
Visualization
00A
ABC => 0B0
C00
The pdep
method uses a single BMI2 intrinsic. Not available on older CPUs.
Method | Perf (m=n) | Perf (m=x86-64) |
---|---|---|
Mul. | 1.23 | 1.22 |
SSE | 0.73 | 0.81 |
pdep | 0.80 | N/A |
Input bits to a LSB in each byte. Equivalent to 'row to column' or '90deg rotation'.
Visualization
A00
ABC => B00
C00
The pdep
method uses a single BMI2 intrinsic. Not available on older CPUs.
Method | Perf (m=n) | Perf (m=x86-64) |
---|---|---|
"Binary" | 1.41 | 1.42 |
Mul. | 0.89 | 0.89 |
clMul. | 0.95 | N/A |
pdep | 0.81 | N/A |
SSE | 0.73 | 0.75 |