Reduce overhead from computing modular reduction parameters #4588

randombit · 2025-01-23T23:02:30Z

This was inspired by an OpenSSL bug; apparently there is/was something wrong with their PKCS8 decoders in OpenSSL 3, and parsing private keys took a very long time. I wrote a test to check how Botan behaved and happily the PKCS8 decoding seems to be fine but it took a very long time to set up the Montgomery arithmetic. Specifically computing the Montgomery params requires setting up a Barrett reduction, which required a constant time division. The constant time division consumed well over 90% of the total runtime when parsing a 4096 bit RSA key in a loop.

First optimize the initial Barrett setup by adding a specialized implementation for 2^k / m. I had hoped to find a specific algorithm - seems like something should be possible here - but Google is useless. But it's anyway a bit faster because we know only 1 bit is set and we can assume k is public.

Second commit follows up by distinguishing between public and secret moduli for Barrett, much as #4569 did for modular inverses. In the public modulus case we can use the much faster variable time division. This also puts some effort towards sharing Barrett constants once we've computed them, eg by making them available from DL_Group and caching them in the RSA key data.

Overall this reduces the cost of repeated parsing of a RSA 4096 bit private key to about 1/4 of current cost on master.

coveralls · 2025-01-23T23:42:51Z

coverage: 91.261% (+0.01%) from 91.248%
when pulling e67f697 on jack/faster-redc-setup
into 19ed81b on master.

reneme

Logic looks good to me. While we're on the topic of reducing overhead, I suggested plenty of places that could benefit from "pass-by-value" and std::move(). Perhaps this could be pushed into a follow-up PR, though. Your call... :)

reneme · 2025-01-24T08:35:37Z

src/lib/pubkey/blinding.cpp

@@ -10,10 +10,11 @@
 namespace Botan {

 Blinder::Blinder(const BigInt& modulus,
+                 const Modular_Reducer& reducer,


If possible, take it as an lvalue and move it into m_reducer.

Here we already have a (somewhat undocumented) lifetime restriction on the Blinder - it captures a reference to the RandomNumberGenerator& and expects it to live as long as the Blinder itself does. We can do the same for the Modular_Reducer&

reneme · 2025-01-27T09:15:56Z

src/lib/math/bigint/divide.cpp

+   BigInt t = BigInt::with_capacity(y_words + 1);  // a temporary
+
+   r.set_bit(y_bits - 1);
+   for(size_t i = y_bits; i != x_bits; ++i) {


Perhaps it makes sense to explicitly BOTAN_ASSERT that y_bits <= x_bits to ensure that this isn't running into an endless loop.

Yes it does

Actually we are allowed to (must) leak the size of the values to side channels so we can just check the bit lengths and correctly return zero if 2^k < y

reneme · 2025-01-27T09:19:54Z

src/lib/math/bigint/divide.cpp

@@ -94,7 +127,7 @@ void ct_divide_word(const BigInt& x, word y, BigInt& q_out, word& r_out) {

      const auto r_carry = CT::Mask<word>::expand_top_bit(r);

-      r *= 2;
+      r <<= 1;


Does it make sense to do this explicitly? I would have expected the compilers to do that anyway?

I was just following the same pattern as elsewhere for consistency.

Also in this context - since we are accumulating into r bit-by-bit as the loop progresses - a shift is more logical than a multiplication.

src/lib/math/numbertheory/reducer.cpp

reneme · 2025-01-27T17:01:58Z

src/lib/math/numbertheory/reducer.h


   private:
+      Modular_Reducer(const BigInt& m, const BigInt& mu, size_t mw) : m_modulus(m), m_mu(mu), m_mod_words(mw) {}


Suggested change

Modular_Reducer(const BigInt& m, const BigInt& mu, size_t mw) : m_modulus(m), m_mu(mu), m_mod_words(mw) {}

Modular_Reducer(BigInt m, BigInt mu, size_t mw) : m_modulus(std::move(m)), m_mu(std::move(mu)), m_mod_words(mw) {}

src/lib/pubkey/blinding/blinding.cpp

reneme · 2025-01-27T17:13:35Z

src/lib/pubkey/dl_group/dl_group.cpp

+      DL_Group_Data(const BigInt& p, const BigInt& g, DL_Group_Source source) :
+            m_p(p),
+            m_g(g),


Suggested change

DL_Group_Data(const BigInt& p, const BigInt& g, DL_Group_Source source) :

m_p(p),

m_g(g),

DL_Group_Data(BigInt p, BigInt g, DL_Group_Source source) :

m_p(std::move(p)),

m_g(std::move(g)),

... also in the pre-existing constructor above.

reneme · 2025-01-27T17:16:02Z

src/lib/pubkey/dl_group/dl_group.cpp

+      BigInt p, q, g;
      ber.decode(p).decode(q).decode(g).verify_end();
+      return std::make_shared<DL_Group_Data>(p, q, g, source);
   } else if(format == DL_Group_Format::ANSI_X9_42) {
+      BigInt p, g, q;
      ber.decode(p).decode(g).decode(q).discard_remaining();
+      return std::make_shared<DL_Group_Data>(p, q, g, source);
   } else if(format == DL_Group_Format::PKCS_3) {
-      // q is left as zero
+      BigInt p, g;
      ber.decode(p).decode(g).discard_remaining();
+      return std::make_shared<DL_Group_Data>(p, g, source);


Ugly, but it is parsing code, after all...

if(format == DL_Group_Format::ANSI_X9_57) { BigInt p, q, g; ber.decode(p).decode(q).decode(g).verify_end(); return std::make_shared<DL_Group_Data>(std::move(p), std::move(q), std::move(g), source); } else if(format == DL_Group_Format::ANSI_X9_42) { BigInt p, g, q; ber.decode(p).decode(g).decode(q).discard_remaining(); return std::make_shared<DL_Group_Data>(std::move(p), std::move(q), std::move(g), source); } else if(format == DL_Group_Format::PKCS_3) { BigInt p, g; ber.decode(p).decode(g).discard_remaining(); return std::make_shared<DL_Group_Data>(std::move(p), std::move(g), source); }

These and others seem accurate but IMO not going to move the needle much since we have more heavy setup of Montgomery parameters etc which will dominate. Fine to do it but a little afield from the goal in thie PR so I'll pass for now.

reneme · 2025-01-27T17:32:43Z

src/lib/pubkey/dl_group/dl_group.cpp

+      return std::make_shared<DL_Group_Data>(p, g, DL_Group_Source::Builtin);
+   } else {
+      return std::make_shared<DL_Group_Data>(p, q, g, DL_Group_Source::Builtin);
+   }
 }

 //static


While we're here:

std::shared_ptr<DL_Group_Data> DL_Group::load_DL_group_info(const char* p_str, const char* g_str) { auto p = BigInt(p_str); auto q = (p - 1) / 2; auto g = BigInt(g_str); return std::make_shared<DL_Group_Data>(std::move(p), std::move(q), std::move(g), DL_Group_Source::Builtin); }

reneme · 2025-01-27T17:35:40Z

src/lib/pubkey/dl_group/dl_group.cpp

 DL_Group::DL_Group(const BigInt& p, const BigInt& g) {
-   m_data = std::make_shared<DL_Group_Data>(p, BigInt::zero(), g, DL_Group_Source::ExternalSource);
+   m_data = std::make_shared<DL_Group_Data>(p, g, DL_Group_Source::ExternalSource);
 }

 /*
 * DL_Group Constructor
 */
 DL_Group::DL_Group(const BigInt& p, const BigInt& q, const BigInt& g) {
-   m_data = std::make_shared<DL_Group_Data>(p, q, g, DL_Group_Source::ExternalSource);
+   if(q.is_zero()) {
+      m_data = std::make_shared<DL_Group_Data>(p, g, DL_Group_Source::ExternalSource);
+   } else {
+      m_data = std::make_shared<DL_Group_Data>(p, q, g, DL_Group_Source::ExternalSource);
+   }
 }


https://www.youtube.com/shorts/NgsNi_mGe_A :)

We can do somewhat better knowing the input is a power of 2.

We can use a much faster variable time division if the modulus is public already, which is the common case. This also (mostly) eliminates the situation where Modular_Reducer can be uninitialized via passing zero to the constructor; this is a holdover from lacking std::optional Also make some changes so Barrett computations can be shared over time, for example by exposing it from DL_Group and accepting it as an argument to Blinder

randombit requested a review from reneme January 23, 2025 23:02

randombit added this to the Botan 3.7.0 milestone Jan 23, 2025

randombit force-pushed the jack/faster-redc-setup branch from aa41ebf to d6c0271 Compare January 23, 2025 23:04

randombit force-pushed the jack/faster-redc-setup branch 2 times, most recently from 0a195c8 to 3c17c73 Compare January 26, 2025 16:23

reneme approved these changes Jan 27, 2025

View reviewed changes

randombit added 4 commits January 28, 2025 06:37

Add a dedicated algorithm for Barrett reduction setup

5c0975f

We can do somewhat better knowing the input is a power of 2.

Improve cli perf tests regarding modular reduction algorithms

3c2fd2c

Have Blinder capture the Modular_Reducer by reference

e67f697

randombit force-pushed the jack/faster-redc-setup branch from 3c17c73 to e67f697 Compare January 28, 2025 12:00

randombit merged commit af60450 into master Jan 28, 2025
39 checks passed

randombit deleted the jack/faster-redc-setup branch January 28, 2025 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce overhead from computing modular reduction parameters #4588

Reduce overhead from computing modular reduction parameters #4588

randombit commented Jan 23, 2025

coveralls commented Jan 23, 2025 •

edited

Loading

reneme left a comment

reneme Jan 24, 2025

randombit Jan 28, 2025

reneme Jan 27, 2025

randombit Jan 28, 2025

randombit Jan 28, 2025

reneme Jan 27, 2025

randombit Jan 28, 2025

reneme Jan 27, 2025

reneme Jan 27, 2025

reneme Jan 27, 2025

randombit Jan 28, 2025

reneme Jan 27, 2025

reneme Jan 27, 2025 •

edited

Loading


		private:
		Modular_Reducer(const BigInt& m, const BigInt& mu, size_t mw) : m_modulus(m), m_mu(mu), m_mod_words(mw) {}

	Modular_Reducer(const BigInt& m, const BigInt& mu, size_t mw) : m_modulus(m), m_mu(mu), m_mod_words(mw) {}
	Modular_Reducer(BigInt m, BigInt mu, size_t mw) : m_modulus(std::move(m)), m_mu(std::move(mu)), m_mod_words(mw) {}

-      DL_Group_Data(const BigInt& p, const BigInt& g, DL_Group_Source source) :
-            m_p(p),
-            m_g(g),
+      DL_Group_Data(BigInt p, BigInt g, DL_Group_Source source) :
+            m_p(std::move(p)),
+            m_g(std::move(g)),

Reduce overhead from computing modular reduction parameters #4588

Reduce overhead from computing modular reduction parameters #4588

Conversation

randombit commented Jan 23, 2025

coveralls commented Jan 23, 2025 • edited Loading

reneme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reneme Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

coveralls commented Jan 23, 2025 •

edited

Loading

reneme Jan 27, 2025 •

edited

Loading