-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce overhead from computing modular reduction parameters #4588
Conversation
aa41ebf
to
d6c0271
Compare
0a195c8
to
3c17c73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic looks good to me. While we're on the topic of reducing overhead, I suggested plenty of places that could benefit from "pass-by-value" and std::move()
. Perhaps this could be pushed into a follow-up PR, though. Your call... :)
src/lib/pubkey/blinding.cpp
Outdated
@@ -10,10 +10,11 @@ | |||
namespace Botan { | |||
|
|||
Blinder::Blinder(const BigInt& modulus, | |||
const Modular_Reducer& reducer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, take it as an lvalue and move it into m_reducer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we already have a (somewhat undocumented) lifetime restriction on the Blinder
- it captures a reference to the RandomNumberGenerator&
and expects it to live as long as the Blinder
itself does. We can do the same for the Modular_Reducer&
src/lib/math/bigint/divide.cpp
Outdated
BigInt t = BigInt::with_capacity(y_words + 1); // a temporary | ||
|
||
r.set_bit(y_bits - 1); | ||
for(size_t i = y_bits; i != x_bits; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it makes sense to explicitly BOTAN_ASSERT
that y_bits <= x_bits
to ensure that this isn't running into an endless loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we are allowed to (must) leak the size of the values to side channels so we can just check the bit lengths and correctly return zero if 2^k < y
@@ -94,7 +127,7 @@ void ct_divide_word(const BigInt& x, word y, BigInt& q_out, word& r_out) { | |||
|
|||
const auto r_carry = CT::Mask<word>::expand_top_bit(r); | |||
|
|||
r *= 2; | |||
r <<= 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to do this explicitly? I would have expected the compilers to do that anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just following the same pattern as elsewhere for consistency.
Also in this context - since we are accumulating into r
bit-by-bit as the loop progresses - a shift is more logical than a multiplication.
src/lib/math/numbertheory/reducer.h
Outdated
|
||
private: | ||
Modular_Reducer(const BigInt& m, const BigInt& mu, size_t mw) : m_modulus(m), m_mu(mu), m_mod_words(mw) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modular_Reducer(const BigInt& m, const BigInt& mu, size_t mw) : m_modulus(m), m_mu(mu), m_mod_words(mw) {} | |
Modular_Reducer(BigInt m, BigInt mu, size_t mw) : m_modulus(std::move(m)), m_mu(std::move(mu)), m_mod_words(mw) {} |
DL_Group_Data(const BigInt& p, const BigInt& g, DL_Group_Source source) : | ||
m_p(p), | ||
m_g(g), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DL_Group_Data(const BigInt& p, const BigInt& g, DL_Group_Source source) : | |
m_p(p), | |
m_g(g), | |
DL_Group_Data(BigInt p, BigInt g, DL_Group_Source source) : | |
m_p(std::move(p)), | |
m_g(std::move(g)), |
... also in the pre-existing constructor above.
BigInt p, q, g; | ||
ber.decode(p).decode(q).decode(g).verify_end(); | ||
return std::make_shared<DL_Group_Data>(p, q, g, source); | ||
} else if(format == DL_Group_Format::ANSI_X9_42) { | ||
BigInt p, g, q; | ||
ber.decode(p).decode(g).decode(q).discard_remaining(); | ||
return std::make_shared<DL_Group_Data>(p, q, g, source); | ||
} else if(format == DL_Group_Format::PKCS_3) { | ||
// q is left as zero | ||
BigInt p, g; | ||
ber.decode(p).decode(g).discard_remaining(); | ||
return std::make_shared<DL_Group_Data>(p, g, source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugly, but it is parsing code, after all...
if(format == DL_Group_Format::ANSI_X9_57) {
BigInt p, q, g;
ber.decode(p).decode(q).decode(g).verify_end();
return std::make_shared<DL_Group_Data>(std::move(p), std::move(q), std::move(g), source);
} else if(format == DL_Group_Format::ANSI_X9_42) {
BigInt p, g, q;
ber.decode(p).decode(g).decode(q).discard_remaining();
return std::make_shared<DL_Group_Data>(std::move(p), std::move(q), std::move(g), source);
} else if(format == DL_Group_Format::PKCS_3) {
BigInt p, g;
ber.decode(p).decode(g).discard_remaining();
return std::make_shared<DL_Group_Data>(std::move(p), std::move(g), source);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These and others seem accurate but IMO not going to move the needle much since we have more heavy setup of Montgomery parameters etc which will dominate. Fine to do it but a little afield from the goal in thie PR so I'll pass for now.
return std::make_shared<DL_Group_Data>(p, g, DL_Group_Source::Builtin); | ||
} else { | ||
return std::make_shared<DL_Group_Data>(p, q, g, DL_Group_Source::Builtin); | ||
} | ||
} | ||
|
||
//static |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're here:
std::shared_ptr<DL_Group_Data> DL_Group::load_DL_group_info(const char* p_str, const char* g_str) {
auto p = BigInt(p_str);
auto q = (p - 1) / 2;
auto g = BigInt(g_str);
return std::make_shared<DL_Group_Data>(std::move(p), std::move(q), std::move(g), DL_Group_Source::Builtin);
}
DL_Group::DL_Group(const BigInt& p, const BigInt& g) { | ||
m_data = std::make_shared<DL_Group_Data>(p, BigInt::zero(), g, DL_Group_Source::ExternalSource); | ||
m_data = std::make_shared<DL_Group_Data>(p, g, DL_Group_Source::ExternalSource); | ||
} | ||
|
||
/* | ||
* DL_Group Constructor | ||
*/ | ||
DL_Group::DL_Group(const BigInt& p, const BigInt& q, const BigInt& g) { | ||
m_data = std::make_shared<DL_Group_Data>(p, q, g, DL_Group_Source::ExternalSource); | ||
if(q.is_zero()) { | ||
m_data = std::make_shared<DL_Group_Data>(p, g, DL_Group_Source::ExternalSource); | ||
} else { | ||
m_data = std::make_shared<DL_Group_Data>(p, q, g, DL_Group_Source::ExternalSource); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do somewhat better knowing the input is a power of 2.
We can use a much faster variable time division if the modulus is public already, which is the common case. This also (mostly) eliminates the situation where Modular_Reducer can be uninitialized via passing zero to the constructor; this is a holdover from lacking std::optional Also make some changes so Barrett computations can be shared over time, for example by exposing it from DL_Group and accepting it as an argument to Blinder
3c17c73
to
e67f697
Compare
This was inspired by an OpenSSL bug; apparently there is/was something wrong with their PKCS8 decoders in OpenSSL 3, and parsing private keys took a very long time. I wrote a test to check how Botan behaved and happily the PKCS8 decoding seems to be fine but it took a very long time to set up the Montgomery arithmetic. Specifically computing the Montgomery params requires setting up a Barrett reduction, which required a constant time division. The constant time division consumed well over 90% of the total runtime when parsing a 4096 bit RSA key in a loop.
First optimize the initial Barrett setup by adding a specialized implementation for
2^k / m
. I had hoped to find a specific algorithm - seems like something should be possible here - but Google is useless. But it's anyway a bit faster because we know only 1 bit is set and we can assumek
is public.Second commit follows up by distinguishing between public and secret moduli for Barrett, much as #4569 did for modular inverses. In the public modulus case we can use the much faster variable time division. This also puts some effort towards sharing Barrett constants once we've computed them, eg by making them available from
DL_Group
and caching them in the RSA key data.Overall this reduces the cost of repeated parsing of a RSA 4096 bit private key to about 1/4 of current cost on master.