Skip to content

Commit

Permalink
thrift: varint: BMI2 (pdep) based varint encoding: branchless 2-5x fa…
Browse files Browse the repository at this point in the history
…ster than loop unrolled

Summary:
BMI2 (`pdep`) varint encoding that's mostly branchless. It's 2-5x faster than the current loop-unrolled version.

Being mostly branchless there's less variability in micro-benchmark runtime compared to the loop-unrolled version:
- the loop-unrolled versions are slowest when encoding random numbers across the entire 64-bit range (some likely large) and branch prediction has most failures.

Kept the fast-pass for values <127 (encoded in 1 byte) which are likely to be frequent. I couldn't find a fully branchless version that performed better anyway.

TLDR:
- `u8`: unroll the two possible values (1B and 2B encoding). Faster in micro-benchmarks than branchless versions I tried (needed more instructions to produce the same value without branches).
- `u16` & `u32`:
-- u16 encodes in up to 3B, u32 in up to 5B.
-- Use `pdep` to encode into a u64 (8 bytes). Write 8 bytes to `QueueAppender`, but keep track of only the bytes that had to be written. This is faster than appending a buffer of bytes using &u64 and size.
-- u16 could be written by encoding using `_pdep_u32` (3 bytes max fit in u32) and using smaller 16B lookup tables. In micro-benchmark that's not faster than using the same code as the one to encode u32 using `_pdep_u64`. In prod will perform better due to sharing the same lookup tables with u32 and u64 versions (less d-cache pressure).
- `u64`: needs up to 10B. `pdep` to encode first 8B and unconditionally write last 2B too (but keep track of `QueueAppender` size properly).

Reviewed By: vitaut

Differential Revision: D29250074

fbshipit-source-id: 1f6a266f45248fcbea30a62ed347564589cb3348
  • Loading branch information
luciang authored and facebook-github-bot committed Jul 22, 2021
1 parent dd7d175 commit 4baba28
Showing 1 changed file with 12 additions and 7 deletions.
19 changes: 12 additions & 7 deletions folly/io/Cursor.h
Original file line number Diff line number Diff line change
Expand Up @@ -811,10 +811,12 @@ template <class Derived>
class Writable {
public:
template <class T>
typename std::enable_if<std::is_arithmetic<T>::value>::type write(T value) {
typename std::enable_if<std::is_arithmetic<T>::value>::type write(
T value, size_t n = sizeof(T)) {
assert(n <= sizeof(T));
const uint8_t* u8 = reinterpret_cast<const uint8_t*>(&value);
Derived* d = static_cast<Derived*>(this);
d->push(u8, sizeof(T));
d->push(u8, n);
}

template <class T>
Expand Down Expand Up @@ -1201,13 +1203,15 @@ class QueueAppender : public detail::Writable<QueueAppender> {
}

template <class T>
typename std::enable_if<std::is_arithmetic<T>::value>::type write(T value) {
typename std::enable_if<std::is_arithmetic<T>::value>::type write(
T value, size_t n = sizeof(T)) {
// We can't fail.
assert(n <= sizeof(T));
if (length() >= sizeof(T)) {
storeUnaligned(queueCache_.writableData(), value);
queueCache_.appendUnsafe(sizeof(T));
queueCache_.appendUnsafe(n);
} else {
writeSlow<T>(value);
writeSlow<T>(value, n);
}
}

Expand Down Expand Up @@ -1259,12 +1263,13 @@ class QueueAppender : public detail::Writable<QueueAppender> {

template <class T>
typename std::enable_if<std::is_arithmetic<T>::value>::type FOLLY_NOINLINE
writeSlow(T value) {
writeSlow(T value, size_t n = sizeof(T)) {
assert(n <= sizeof(T));
queueCache_.queue()->preallocate(sizeof(T), growth_);
queueCache_.fillCache();

storeUnaligned(queueCache_.writableData(), value);
queueCache_.appendUnsafe(sizeof(T));
queueCache_.appendUnsafe(n);
}
};

Expand Down

0 comments on commit 4baba28

Please sign in to comment.