Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More optimizations #101

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 22 additions & 9 deletions bddisasm/bdx86_decoder.c
Original file line number Diff line number Diff line change
Expand Up @@ -3813,22 +3813,27 @@ NdGetEffectiveAddrAndOpMode(
static const ND_UINT8 szLut[3] = { ND_SIZE_16BIT, ND_SIZE_32BIT, ND_SIZE_64BIT };
ND_BOOL w64, f64, d64, has66;

if ((ND_CODE_64 != Instrux->DefCode) && !!(Instrux->Attributes & ND_FLAG_IWO64))
// Branchless form of (ND_CODE_64 != Instrux->DefCode) && !!(Instrux->Attributes & ND_FLAG_IWO64)
if (((ND_CODE_64 ^ Instrux->DefCode) * (Instrux->Attributes & ND_FLAG_IWO64)) != 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the branch predictors in modern CPUs to be quite competent at predicting the branches here, since they operate on INSTRUX state that will very often be the same, allowing for a very high prediction rate.
I understand that these changes may remove some conditional moves or branches, but is it really worth it? At some point, one has to decide whether the (small) increase in performance is worth the less readable code. I don't think this is the case here. In addition, depending on the architecture, the multiplication may be a new source of overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiplication pipelines easier than branches. I wasn't entirely happy with this commit either but it did make it faster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler may even ditch the multiplication altogether, and rely on conditional instructions (such as CMOVcc or SETcc), but this is not the point. The point is that it shouldn't really matter, since modern branch prediction should easily handle these branches with minimal overhead. These statements are not a performance bottleneck.

{
// Some instructions ignore VEX/EVEX.W field outside 64 bit mode, and treat it as 0.
Instrux->Exs.w = 0;
}

// Extract the flags.
w64 = (0 != Instrux->Exs.w) && !(Instrux->Attributes & ND_FLAG_WIG);
// Branchless form of (0 != Instrux->Exs.w) && !(Instrux->Attributes & ND_FLAG_WIG)
w64 = (0 != Instrux->Exs.w) * !(Instrux->Attributes & ND_FLAG_WIG);

// In 64 bit mode, the operand is forced to 64 bit. Size-changing prefixes are ignored.
f64 = 0 != (Instrux->Attributes & ND_FLAG_F64) && (ND_VEND_AMD != Instrux->VendMode);
// Branchless form of 0 != (Instrux->Attributes & ND_FLAG_F64) && (ND_VEND_AMD != Instrux->VendMode)
f64 = 0 != ((Instrux->Attributes & ND_FLAG_F64) * (ND_VEND_AMD ^ Instrux->VendMode));

// In 64 bit mode, the operand defaults to 64 bit. No 32 bit form of the instruction exists. Note that on AMD,
// only default 64 bit operands exist, even for branches - no operand is forced to 64 bit.
d64 = (0 != (Instrux->Attributes & ND_FLAG_D64)) ||
(0 != (Instrux->Attributes & ND_FLAG_F64) && (ND_VEND_AMD == Instrux->VendMode));
// Branchless form of (0 != (Instrux->Attributes & ND_FLAG_D64)) ||
// (0 != (Instrux->Attributes & ND_FLAG_F64) && (ND_VEND_AMD == Instrux->VendMode));
d64 = (Instrux->Attributes & ND_FLAG_D64) +
((Instrux->Attributes & ND_FLAG_F64) * (ND_VEND_AMD == Instrux->VendMode));

// Check if 0x66 is indeed interpreted as a size changing prefix. Note that if 0x66 is a mandatory prefix,
// then it won't be interpreted as a size changing prefix. However, there is an exception: MOVBE and CRC32
Expand Down Expand Up @@ -4427,11 +4432,19 @@ NdDecodeWithContext(
Instrux->EncMode = ND_ENCM_LEGACY; // Assume legacy encoding by default.

// Fetch the instruction bytes.
for (opIndex = 0;
opIndex < ((Size < ND_MAX_INSTRUCTION_LENGTH) ? Size : ND_MAX_INSTRUCTION_LENGTH);
opIndex++)
if (Size < ND_MAX_INSTRUCTION_LENGTH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just moves the condition from the for outside. Modern compilers should be able to deal with it as it is (for example, MSVC unrolls the entire loop, so introducing this if actually makes things worse).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very likely GCC doing something stupid. I'll test clang too and see what happens.

{
Instrux->InstructionBytes[opIndex] = Code[opIndex];
for (opIndex = 0; opIndex < Size; opIndex++)
{
Instrux->InstructionBytes[opIndex] = Code[opIndex];
}
}
else
{
for (opIndex = 0; opIndex < ND_MAX_INSTRUCTION_LENGTH; opIndex++)
{
Instrux->InstructionBytes[opIndex] = Code[opIndex];
}
}

if (gPrefixesMap[Instrux->InstructionBytes[0]] != ND_PREF_CODE_NONE)
Expand Down