Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce overhead from computing modular reduction parameters #4588

Merged
merged 4 commits into from
Jan 28, 2025

Conversation

randombit
Copy link
Owner

This was inspired by an OpenSSL bug; apparently there is/was something wrong with their PKCS8 decoders in OpenSSL 3, and parsing private keys took a very long time. I wrote a test to check how Botan behaved and happily the PKCS8 decoding seems to be fine but it took a very long time to set up the Montgomery arithmetic. Specifically computing the Montgomery params requires setting up a Barrett reduction, which required a constant time division. The constant time division consumed well over 90% of the total runtime when parsing a 4096 bit RSA key in a loop.

First optimize the initial Barrett setup by adding a specialized implementation for 2^k / m. I had hoped to find a specific algorithm - seems like something should be possible here - but Google is useless. But it's anyway a bit faster because we know only 1 bit is set and we can assume k is public.

Second commit follows up by distinguishing between public and secret moduli for Barrett, much as #4569 did for modular inverses. In the public modulus case we can use the much faster variable time division. This also puts some effort towards sharing Barrett constants once we've computed them, eg by making them available from DL_Group and caching them in the RSA key data.

Overall this reduces the cost of repeated parsing of a RSA 4096 bit private key to about 1/4 of current cost on master.

@randombit randombit requested a review from reneme January 23, 2025 23:02
@randombit randombit added this to the Botan 3.7.0 milestone Jan 23, 2025
@randombit randombit force-pushed the jack/faster-redc-setup branch from aa41ebf to d6c0271 Compare January 23, 2025 23:04
@coveralls
Copy link

coveralls commented Jan 23, 2025

Coverage Status

coverage: 91.261% (+0.01%) from 91.248%
when pulling e67f697 on jack/faster-redc-setup
into 19ed81b on master.

@randombit randombit force-pushed the jack/faster-redc-setup branch 2 times, most recently from 0a195c8 to 3c17c73 Compare January 26, 2025 16:23
Copy link
Collaborator

@reneme reneme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good to me. While we're on the topic of reducing overhead, I suggested plenty of places that could benefit from "pass-by-value" and std::move(). Perhaps this could be pushed into a follow-up PR, though. Your call... :)

@@ -10,10 +10,11 @@
namespace Botan {

Blinder::Blinder(const BigInt& modulus,
const Modular_Reducer& reducer,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, take it as an lvalue and move it into m_reducer.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we already have a (somewhat undocumented) lifetime restriction on the Blinder - it captures a reference to the RandomNumberGenerator& and expects it to live as long as the Blinder itself does. We can do the same for the Modular_Reducer&

BigInt t = BigInt::with_capacity(y_words + 1); // a temporary

r.set_bit(y_bits - 1);
for(size_t i = y_bits; i != x_bits; ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it makes sense to explicitly BOTAN_ASSERT that y_bits <= x_bits to ensure that this isn't running into an endless loop.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we are allowed to (must) leak the size of the values to side channels so we can just check the bit lengths and correctly return zero if 2^k < y

@@ -94,7 +127,7 @@ void ct_divide_word(const BigInt& x, word y, BigInt& q_out, word& r_out) {

const auto r_carry = CT::Mask<word>::expand_top_bit(r);

r *= 2;
r <<= 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to do this explicitly? I would have expected the compilers to do that anyway?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the same pattern as elsewhere for consistency.

Also in this context - since we are accumulating into r bit-by-bit as the loop progresses - a shift is more logical than a multiplication.

src/lib/math/numbertheory/reducer.cpp Outdated Show resolved Hide resolved

private:
Modular_Reducer(const BigInt& m, const BigInt& mu, size_t mw) : m_modulus(m), m_mu(mu), m_mod_words(mw) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Modular_Reducer(const BigInt& m, const BigInt& mu, size_t mw) : m_modulus(m), m_mu(mu), m_mod_words(mw) {}
Modular_Reducer(BigInt m, BigInt mu, size_t mw) : m_modulus(std::move(m)), m_mu(std::move(mu)), m_mod_words(mw) {}

src/lib/pubkey/blinding/blinding.cpp Outdated Show resolved Hide resolved
Comment on lines +42 to +44
DL_Group_Data(const BigInt& p, const BigInt& g, DL_Group_Source source) :
m_p(p),
m_g(g),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DL_Group_Data(const BigInt& p, const BigInt& g, DL_Group_Source source) :
m_p(p),
m_g(g),
DL_Group_Data(BigInt p, BigInt g, DL_Group_Source source) :
m_p(std::move(p)),
m_g(std::move(g)),

... also in the pre-existing constructor above.

Comment on lines +136 to +146
BigInt p, q, g;
ber.decode(p).decode(q).decode(g).verify_end();
return std::make_shared<DL_Group_Data>(p, q, g, source);
} else if(format == DL_Group_Format::ANSI_X9_42) {
BigInt p, g, q;
ber.decode(p).decode(g).decode(q).discard_remaining();
return std::make_shared<DL_Group_Data>(p, q, g, source);
} else if(format == DL_Group_Format::PKCS_3) {
// q is left as zero
BigInt p, g;
ber.decode(p).decode(g).discard_remaining();
return std::make_shared<DL_Group_Data>(p, g, source);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugly, but it is parsing code, after all...

if(format == DL_Group_Format::ANSI_X9_57) {
   BigInt p, q, g;
   ber.decode(p).decode(q).decode(g).verify_end();
   return std::make_shared<DL_Group_Data>(std::move(p), std::move(q), std::move(g), source);
} else if(format == DL_Group_Format::ANSI_X9_42) {
   BigInt p, g, q;
   ber.decode(p).decode(g).decode(q).discard_remaining();
   return std::make_shared<DL_Group_Data>(std::move(p), std::move(q), std::move(g), source);
} else if(format == DL_Group_Format::PKCS_3) {
   BigInt p, g;
   ber.decode(p).decode(g).discard_remaining();
   return std::make_shared<DL_Group_Data>(std::move(p), std::move(g), source);
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These and others seem accurate but IMO not going to move the needle much since we have more heavy setup of Montgomery parameters etc which will dominate. Fine to do it but a little afield from the goal in thie PR so I'll pass for now.

return std::make_shared<DL_Group_Data>(p, g, DL_Group_Source::Builtin);
} else {
return std::make_shared<DL_Group_Data>(p, q, g, DL_Group_Source::Builtin);
}
}

//static
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're here:

std::shared_ptr<DL_Group_Data> DL_Group::load_DL_group_info(const char* p_str, const char* g_str) {
   auto p = BigInt(p_str);
   auto q = (p - 1) / 2;
   auto g = BigInt(g_str);

   return std::make_shared<DL_Group_Data>(std::move(p), std::move(q), std::move(g), DL_Group_Source::Builtin);
}

Comment on lines 340 to 353
DL_Group::DL_Group(const BigInt& p, const BigInt& g) {
m_data = std::make_shared<DL_Group_Data>(p, BigInt::zero(), g, DL_Group_Source::ExternalSource);
m_data = std::make_shared<DL_Group_Data>(p, g, DL_Group_Source::ExternalSource);
}

/*
* DL_Group Constructor
*/
DL_Group::DL_Group(const BigInt& p, const BigInt& q, const BigInt& g) {
m_data = std::make_shared<DL_Group_Data>(p, q, g, DL_Group_Source::ExternalSource);
if(q.is_zero()) {
m_data = std::make_shared<DL_Group_Data>(p, g, DL_Group_Source::ExternalSource);
} else {
m_data = std::make_shared<DL_Group_Data>(p, q, g, DL_Group_Source::ExternalSource);
}
}
Copy link
Collaborator

@reneme reneme Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do somewhat better knowing the input is a power of 2.
We can use a much faster variable time division if the modulus is
public already, which is the common case.

This also (mostly) eliminates the situation where Modular_Reducer can
be uninitialized via passing zero to the constructor; this is a
holdover from lacking std::optional

Also make some changes so Barrett computations can be shared over
time, for example by exposing it from DL_Group and accepting it
as an argument to Blinder
@randombit randombit force-pushed the jack/faster-redc-setup branch from 3c17c73 to e67f697 Compare January 28, 2025 12:00
@randombit randombit merged commit af60450 into master Jan 28, 2025
39 checks passed
@randombit randombit deleted the jack/faster-redc-setup branch January 28, 2025 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants