Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle CPU_RNG failures in OCaml: #255

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

hannesm
Copy link
Member

@hannesm hannesm commented Jan 8, 2025

  • Previously, a return value of "0" was treated specially, but rdrand/rdseed may return 0!
  • If there's a failure in rdrand/rdseed during cpu_rng_bootstrap, raise an exception
  • If there's a failure during the periodic reseeding (timer initiated feeding of all pools), ignore it (for now!?)

//cc @edwintorok addresses #251 -- though I'm not yet happy with the cpu_rng failure case // feed_pools error case. I'm uncertain what to do:

  • do a while true, potentially leading to a busy loop if rdrand/rdseed keeps on giving no data
  • raise an exception and exit - which may be inconvenient for the application developer
  • call rdseed/rdrand (the other instruction) as an attempt to delay the error condition

- Previously, a return value of "0" was treated specially, but rdrand/rdseed
  may return 0!
- If there's a failure in rdrand/rdseed during cpu_rng_bootstrap, raise an
  exception
- If there's a failure during the periodic reseeding (timer initiated feeding
  of all pools), ignore it (for now!?)
@edwintorok
Copy link
Contributor

edwintorok commented Jan 8, 2025

  • do a while true, potentially leading to a busy loop if rdrand/rdseed keeps on giving no data

Perhaps yield, or do an exponential backoff (requires the ability to sleep for a certain of time which eio/lwt/async does provide but with a different function each) and try again?
That gives the CPU time to accumulate more entropy and avoid failing again.

And to reduce the likelihood of failure see #253, the retry count for RDSEED can be increased, and if RDSEED fails then we could fall back to calling RDRAND a number of times if we follow #252 (rdrand has some built in fairness between cores, whereas rdseed just outright fails often).

  • raise an exception and exit - which may be inconvenient for the application developer

I don't think we should do that, at least not for RDSEED failures, which is documented to fail, and fails within a couple of seconds very easily. A failure here doesn't mean the CPU is broken, it just means the CPU is busy...

@hannesm hannesm force-pushed the cpu-rng-adjustments branch from 2f1743d to 21945e6 Compare January 8, 2025 16:06
@hannesm
Copy link
Member Author

hannesm commented Jan 8, 2025

Thanks for your insights. So, for cpu_rng_bootstrap, I think failing hard is fine (this is done at startup anyways). This is as well the thing that prefers to call rdseed.

Now, for periodic feeding of the rng, we use the cpu_rng task - which prefers to use rdrand! Here the failure is currently ignored (and indeed, lwt/async/eio/miou_unix call this periodically -- by default every second.

My questions are now:

  • if the cpu only has rdrand, should our cpu_rng_bootstrap do more work (call rdrand 1024 times and compress the random, as done in the reference you linked to)?
    • related: should if rdseed fails in cpu_rng_bootstrap we go the rdrand route (with multiple calls) instead of raising an exception?
  • if the cpu only has rdseed, should our periodic task handle that more graceful?
  • should we increase RETRIES for rdseed to 100?

@edwintorok
Copy link
Contributor

this is done at startup anyways)

Startup of a mirage unikernel perhaps, but not startup of the host. There could be any number of other VMs or processes running on the host CPU already, and I think depleting entropy with RDSEED depletes it globally for the whole CPU, not just your core, but I'll have to do some tests to confirm that (i.e. if I busy loop calling rdseed or rdrand on 31 cores, will my 32th core get a chance of a successful rdseed ever?)

  • and compress the random,

I think that is already taken care by Fortuna. If you give Fortuna all the rdrand values you have it'll figure out what to do with them (IIRC Fortuna doesn't require you to estimate how much entropy you've got in each sample).

  • f the cpu only has rdrand, should our cpu_rng_bootstrap do more work (call rdrand 1024 times

    • related: should if rdseed fails in cpu_rng_bootstrap we go the rdrand route (with multiple calls) instead of raising an exception?
  • if the cpu only has rdseed, should our periodic task handle that more graceful?

  • should we increase RETRIES for rdseed to 100?

I'd say probably, but I'll need to reread the manual and do some experiments (also I'll need to see whether the AMD SDM has any documentation on this that has other suggestions).

When I looked at this yesterday 100 wasn't enough for RDSEED. Reading between the lines it seems like it'd be possible for some cores to DoS the ability to run RDSEED on other cores ~forever, but I'll need to confirm whether that is true, and whether in that case falling back to RDRAND would help or not.

But regardless of what the manual says we can't trust hardware (and future generations of hardware may behave differently or have different bugs anyway), so we must have some way to handle failure.

I'd say that a failure during reseeding is not critical: if it was safe enough for the RNG to work and generate numbers before, then why wouldn't it be safe to continue as is. And if the answer is that it isn't safe to continue without reseeding then the RNG wasn't seeded well enough in the first place and shouldn't have given out any random numbers until properly seeded.

I'd suggest to use an Atomic.incr to count the number of failures, and someone interested could monitor that counter, set up some alerting from the application, or crash the application if that is what they wish, but the library should leave that choice up to the application.

@hannesm
Copy link
Member Author

hannesm commented Jan 8, 2025

Thanks again, edwin. With 276ffe2 the rdrand and rdseed calls and failures are exposed now.

Remaining: rdrand when rdseed is unavailable during bootstrap; also resort to rdrand if rdseed is not returning good values?

And how many RETRIES for rdseed.

@hannesm
Copy link
Member Author

hannesm commented Jan 8, 2025

Thank you very much for your time, comments and testing!

if rdseed fails, take 512 times rdrand.

if rdrand fails 512 times (* 10 from the RETRIES in C), fail
@hannesm hannesm force-pushed the cpu-rng-adjustments branch from 0ded58a to a55afa4 Compare January 8, 2025 17:04
@hannesm
Copy link
Member Author

hannesm commented Jan 8, 2025

With a55afa4 if only rdrand is available or rdseed fails during bootstrap, we call it 512 times.

An error is logged on failure.

The code, as before (in bootstrap) - resorts to using whirlwind.

@hannesm
Copy link
Member Author

hannesm commented Jan 10, 2025

I adjusted the retries -- rdseed now 100, rdrand still at 10.

From my side, this is ready for review & merge. It may be sensible to cut a release thereafter (with #254 merged as well).

Any opinion @edwintorok?

Copy link
Member

@dinosaure dinosaure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok, the errors on the CI seems unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants