Handle CPU_RNG failures in OCaml: #255

hannesm · 2025-01-08T15:44:49Z

Previously, a return value of "0" was treated specially, but rdrand/rdseed may return 0!
If there's a failure in rdrand/rdseed during cpu_rng_bootstrap, raise an exception
If there's a failure during the periodic reseeding (timer initiated feeding of all pools), ignore it (for now!?)

//cc @edwintorok addresses #251 -- though I'm not yet happy with the cpu_rng failure case // feed_pools error case. I'm uncertain what to do:

do a while true, potentially leading to a busy loop if rdrand/rdseed keeps on giving no data
raise an exception and exit - which may be inconvenient for the application developer
call rdseed/rdrand (the other instruction) as an attempt to delay the error condition

- Previously, a return value of "0" was treated specially, but rdrand/rdseed may return 0! - If there's a failure in rdrand/rdseed during cpu_rng_bootstrap, raise an exception - If there's a failure during the periodic reseeding (timer initiated feeding of all pools), ignore it (for now!?)

src/native/entropy_cpu_stubs.c

edwintorok · 2025-01-08T16:06:05Z

do a while true, potentially leading to a busy loop if rdrand/rdseed keeps on giving no data

Perhaps yield, or do an exponential backoff (requires the ability to sleep for a certain of time which eio/lwt/async does provide but with a different function each) and try again?
That gives the CPU time to accumulate more entropy and avoid failing again.

And to reduce the likelihood of failure see #253, the retry count for RDSEED can be increased, and if RDSEED fails then we could fall back to calling RDRAND a number of times if we follow #252 (rdrand has some built in fairness between cores, whereas rdseed just outright fails often).

raise an exception and exit - which may be inconvenient for the application developer

I don't think we should do that, at least not for RDSEED failures, which is documented to fail, and fails within a couple of seconds very easily. A failure here doesn't mean the CPU is broken, it just means the CPU is busy...

hannesm · 2025-01-08T16:13:01Z

Thanks for your insights. So, for cpu_rng_bootstrap, I think failing hard is fine (this is done at startup anyways). This is as well the thing that prefers to call rdseed.

Now, for periodic feeding of the rng, we use the cpu_rng task - which prefers to use rdrand! Here the failure is currently ignored (and indeed, lwt/async/eio/miou_unix call this periodically -- by default every second.

My questions are now:

if the cpu only has rdrand, should our cpu_rng_bootstrap do more work (call rdrand 1024 times and compress the random, as done in the reference you linked to)?
- related: should if rdseed fails in cpu_rng_bootstrap we go the rdrand route (with multiple calls) instead of raising an exception?
if the cpu only has rdseed, should our periodic task handle that more graceful?
should we increase RETRIES for rdseed to 100?

edwintorok · 2025-01-08T16:22:39Z

this is done at startup anyways)

Startup of a mirage unikernel perhaps, but not startup of the host. There could be any number of other VMs or processes running on the host CPU already, and I think depleting entropy with RDSEED depletes it globally for the whole CPU, not just your core, but I'll have to do some tests to confirm that (i.e. if I busy loop calling rdseed or rdrand on 31 cores, will my 32th core get a chance of a successful rdseed ever?)

and compress the random,

I think that is already taken care by Fortuna. If you give Fortuna all the rdrand values you have it'll figure out what to do with them (IIRC Fortuna doesn't require you to estimate how much entropy you've got in each sample).

f the cpu only has rdrand, should our cpu_rng_bootstrap do more work (call rdrand 1024 times

related: should if rdseed fails in cpu_rng_bootstrap we go the rdrand route (with multiple calls) instead of raising an exception?

if the cpu only has rdseed, should our periodic task handle that more graceful?

should we increase RETRIES for rdseed to 100?

I'd say probably, but I'll need to reread the manual and do some experiments (also I'll need to see whether the AMD SDM has any documentation on this that has other suggestions).

When I looked at this yesterday 100 wasn't enough for RDSEED. Reading between the lines it seems like it'd be possible for some cores to DoS the ability to run RDSEED on other cores ~forever, but I'll need to confirm whether that is true, and whether in that case falling back to RDRAND would help or not.

But regardless of what the manual says we can't trust hardware (and future generations of hardware may behave differently or have different bugs anyway), so we must have some way to handle failure.

I'd say that a failure during reseeding is not critical: if it was safe enough for the RNG to work and generate numbers before, then why wouldn't it be safe to continue as is. And if the answer is that it isn't safe to continue without reseeding then the RNG wasn't seeded well enough in the first place and shouldn't have given out any random numbers until properly seeded.

I'd suggest to use an Atomic.incr to count the number of failures, and someone interested could monitor that counter, set up some alerting from the application, or crash the application if that is what they wish, but the library should leave that choice up to the application.

hannesm · 2025-01-08T16:33:54Z

Thanks again, edwin. With 276ffe2 the rdrand and rdseed calls and failures are exposed now.

Remaining: rdrand when rdseed is unavailable during bootstrap; also resort to rdrand if rdseed is not returning good values?

And how many RETRIES for rdseed.

hannesm · 2025-01-08T16:34:16Z

Thank you very much for your time, comments and testing!

if rdseed fails, take 512 times rdrand. if rdrand fails 512 times (* 10 from the RETRIES in C), fail

hannesm · 2025-01-08T17:05:14Z

With a55afa4 if only rdrand is available or rdseed fails during bootstrap, we call it 512 times.

An error is logged on failure.

The code, as before (in bootstrap) - resorts to using whirlwind.

hannesm · 2025-01-10T14:55:15Z

I adjusted the retries -- rdseed now 100, rdrand still at 10.

From my side, this is ready for review & merge. It may be sensible to cut a release thereafter (with #254 merged as well).

Any opinion @edwintorok?

dinosaure

Seems ok, the errors on the CI seems unrelated.

edwintorok reviewed Jan 8, 2025

View reviewed changes

src/native/entropy_cpu_stubs.c Outdated Show resolved Hide resolved

use memcpy instead of manual copying

21945e6

hannesm force-pushed the cpu-rng-adjustments branch from 2f1743d to 21945e6 Compare January 8, 2025 16:06

entropy: provide rdrand and rdseed calls and failures

276ffe2

cpu_rng_bootstrap: use rdrand if rdseed fails.

a55afa4

if rdseed fails, take 512 times rdrand. if rdrand fails 512 times (* 10 from the RETRIES in C), fail

hannesm force-pushed the cpu-rng-adjustments branch from 0ded58a to a55afa4 Compare January 8, 2025 17:04

rdseed: 100 retries, rdrand: 10 retries

d74c32f

hannesm added 2 commits January 16, 2025 11:21

avoid unused parameter warnings

80a30a5

use Val_false directly

7c7db2b

dinosaure approved these changes Jan 20, 2025

View reviewed changes

reynir approved these changes Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle CPU_RNG failures in OCaml: #255

Handle CPU_RNG failures in OCaml: #255

hannesm commented Jan 8, 2025

edwintorok commented Jan 8, 2025 •

edited

Loading

hannesm commented Jan 8, 2025

edwintorok commented Jan 8, 2025

hannesm commented Jan 8, 2025

hannesm commented Jan 8, 2025

hannesm commented Jan 8, 2025

hannesm commented Jan 10, 2025

dinosaure left a comment

Handle CPU_RNG failures in OCaml: #255

Are you sure you want to change the base?

Handle CPU_RNG failures in OCaml: #255

Conversation

hannesm commented Jan 8, 2025

edwintorok commented Jan 8, 2025 • edited Loading

hannesm commented Jan 8, 2025

edwintorok commented Jan 8, 2025

hannesm commented Jan 8, 2025

hannesm commented Jan 8, 2025

hannesm commented Jan 8, 2025

hannesm commented Jan 10, 2025

dinosaure left a comment

Choose a reason for hiding this comment

edwintorok commented Jan 8, 2025 •

edited

Loading