Pseudorandom numbers are sometimes the same across processes

### Current behavior 😯

#### Background

Fast pseudorandom numbers are used in a few places where they probably need not be cryptographically secure, but where the generated bytes are intended to be different across any kind of concurrent tasks, including across separate processes. Existing uses in gitoxide--but not necessarily all of their callers--are as follows:

1. [`gix_utils::backoff::Quadratic::default_with_random`](https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/gix-utils/src/backoff.rs#L33-L40) supplies a series of increasing timings that are adjusted with pseudorandom [jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/), so concurrent tasks don't all retry at the same time. Its [`randomize`](https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/gix-utils/src/backoff.rs#L3-L10) helper computes:

   https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/gix-utils/src/backoff.rs#L4

   `default_with_random` is public, so there may be non-gitoxide crates that use it directly, probably in the same kind of scenarios it is used here. Here, it is used to avoid some cases of the [thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem) problem: when accessing a network, at least [in `gix-testtools`](https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/tests/tools/src/lib.rs#L223); as well as [in `gix-lock`](https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/gix-lock/src/acquire.rs#L179), which runs in production.

   (Per [gitpython-developers/gitdb#115 (comment)](https://github.com/gitpython-developers/gitdb/pull/115#discussion_r1903215598) and [#1815](https://github.com/GitoxideLabs/gitoxide/pull/1815), this was recently renamed from `Exponential` to `Quadratic`. But that was just a name change, with no effect on this issue.)

2. [`gix_fs::capabilities::Capabilities::probe`](https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/gix-fs/src/capabilities.rs#L48-L56) includes a check for the ability to create symlinks in the `git_dir`, done by its [`probe_symlink`](https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/gix-fs/src/capabilities.rs#L105-L115) helper with a filename produced by:

   https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/gix-fs/src/capabilities.rs#L106-L107

   As discussed in [#1789](https://github.com/GitoxideLabs/gitoxide/discussions/1789), a collision in the filename, even across processes, causes the probe to wrongly report that symlinks are not supported.

3. In addition to its use of backoff with pseudorandom jitter mentioned above, [`gix_testtools::spawn_git_daemon`](https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/tests/tools/src/lib.rs#L204-L233) also shuffles the order of ports to try listening on:

   https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/tests/tools/src/lib.rs#L206-L208

   Unlike (1) and (2), I think it is not really a problem here if the same orders are sometimes used. But performance is slightly better if that is avoided.

#### The bug

As currently implemented, we occasionally generate the same sequence of pseudorandom numbers across concurrently running processes. This was observed in #1789 to affect (2) when running parallel tests, but in addition to affecting the symlink capability probe, it should be expected occasionally to affect cases (1) and (3). Case (1) is overall the most important.

See "The cause" subsection below on why this happens, and the "Steps to reproduce" section or #1789 for details of the recently observed failure due to (2).

Although I was only able to cite one occurrence of (2), I vaguely recalled having seen similar errors. Since the test case that failed only started asserting that the probe returned `true` a couple months ago, occasional failures may have been different or absent in the past. For these reasons, I estimate that (2) has occurred on CI about once a month for an extended time, but this estimate is extremely low confidence.

In contrast to (2), I would not expect our existing automated tests ever to surface (1) or (3):

- While case (1) is the most important, it is probably both less severe and far less common than the previous known thundering herd condition ([praetorian-inc/noseyparker#179](https://github.com/praetorian-inc/noseyparker/issues/179), addf446f052ff74edcdb083f2b2968b313daa940, [#1412](https://github.com/GitoxideLabs/gitoxide/discussions/1412)). That bug was not reproduced by the project's automated tests (though I believe a manual approach with `hyperfine` was able to reproduce it). I would expect this issue to be encountered less often than that.
- Case (3) only occurs in tests, though it could occur in other projects' test suites if they use `gix-testtools` (or in non-test code in the rare case that another project uses `gix-testtools` for something other than tests). But it is hard to observe because it doesn't really cause a problem: if and when the collision occurs, the effect is at most a small slowdown.
- In contrast, while (2) is not limited to tests, it probably most often occurs in them. It is also more observable when it does occur, because a single collision causes a test failure, rather than requiring multiple collisions to cause a serious problem as in (1), or never causing a serious problem as in (3).

#### The cause

This happens because we use `fastrand` with the per-thread generators it provides, [which use](https://github.com/smol-rs/fastrand/blob/v2.3.0/src/global_rng.rs) seeds produced on most platforms by hashing together monotonic clock time and thread ID, but nothing that inherently varies across concurrently running *processes*:

```rust
std::thread_local! {
    static RNG: Cell<Rng> = Cell::new(Rng(random_seed().unwrap_or(DEFAULT_RNG_SEED)));
}
```

```rust
#[cfg(not(all(
    any(target_arch = "wasm32", target_arch = "wasm64"),
    target_os = "unknown"
)))]
fn random_seed() -> Option<u64> {
    use std::collections::hash_map::DefaultHasher;
    use std::hash::{Hash, Hasher};
    use std::thread;
    use std::time::Instant;

    let mut hasher = DefaultHasher::new();
    Instant::now().hash(&mut hasher);
    thread::current().id().hash(&mut hasher);
    Some(hasher.finish())
}
```

Thread IDs, as returned by `std::thread::current().id()`, are often small numbers like 1. They are frequently the same between threads of separate, concurrently running processes.

### Expected behavior 🤔

To avoid even rare cases of the thundering herd problem when used to generate jitter for backoff--as well as to avoid other glitches such as collisions when probing for the ability to create symlinks--even pseudorandom numbers that do not need to be cryptographically secure should be generated in a way that does not depend on state that is likely to be shared across concurrently running pseudorandom generations.

#### Possible solutions

That leaves open a wide range of possible solutions. Assuming everything that needs cryptographically secure random numbers is already implemented in a crate that uses CSPRNGs rather than `fastrand`--which seems to be the case--I think that:

- An extremely fast, cryptographically insecure *algorithm* may still be the way to go here. The `fastrand` crate could still be used, unless something else turns out to be more convenient.
- However, a higher entropy *seed* should be used.

I don't know if we could continue using the per-thread PRNGs `fastrand` arranges, though. A `fastrand::Rng` can be reseeded, including an `Rng` that `fastrand` provides for a thread. But this would have to be done on each thread whose generator is used. Furthermore, since there could be many threads and higher-entropy sources are probably always at least a little slower than the above approach, each thread's PRNG should not be seeded until it is first used.

At first glance, it feels like it would be enough to do even *slightly* better with the seed. However, I think this actually requires as unpredictable a seed--or, rather, 8 bytes of seed material per thread, to continue using shared thread-local PRNGs--as can practically be obtained without heavy delays.

#### Why not just mix in the PID?

Suppose we do something like `fastrand` does for its per-thread PRNGs, but also mix in the current process ID. That seems like the simple and obvious thing to do, given that the observed problem is that processes are not distinguished. PIDs are reused, but not concurrently.

I think this is insufficient for two reasons:

- Some targets (WASM?) may not have access to PIDs, so this is not a complete solution.
- Concurrent processes can run in different PID namespaces. If different instances are run in separate containers, even within in the same system, they could occasionally have not only the same monotonic clock times and thread IDs, as has been observed, but also the same process IDs.

The second reason is more important because it suggests that seeding with monotonic clock time, thread ID, and PID would not be desirable even if a backup strategy were used for cases where the PID is unavailable.

Another way to describe the second reason is that just mixing in the PID would turn a collision that happens between separate processes into a collision that (still) happens across separate containers. This might, in practice, fix (2), if the containers don't write to the same volume. But it would not be a complete fix for (1).

However, this would be a significant mitigation even in (1). So if all other approaches turn out to involve dependencies that need to be avoided in some cases, then a new default feature could be added for the primary solution, and mixing in the PID might then be considered as a partial fallback solution when that feature is disabled.

#### We can use `getrandom` for the seed

The [`getrandom`](https://crates.io/crates/getrandom) library obtains random data from the operating system in a way that is intended to be as robust and unpredictable as possible, typically using the best source of randomness that is available in each environment.

There are a number of tradeoffs between robustness and speed. That library navigates them. For example, on some Unix-like systems, it does a blocking read on `/dev/random` to avoid returning data that are insufficiently random because the system entropy pool is not yet adequately seeded, but then uses `/dev/urandom` to avoid further blocking that could be excessive.

For some platforms, it requires explicit build configuration to work. However, that is rare. I think this only applies to WASM targets with `unknown-unknown`. I think the only conditions under which this happens are those where `fastrand` is already using a fixed seed.

To be clear, I am not advocating generating our random numbers with `getrandom`, but only the seed. Unless a reason is found for the uses listed above to use cryptographically secure random numbers, the algorithm being seeded could continue to be a cryptographically insecure but extremely fast algorithm. `fastrand` could continue to be used. If its interface is less convenient when the seeds are computed separately, then some other libraries could be considered (and there are some other Rust libraries that use Wyrand, the algorithm `fastrand` uses).

#### Can we directly read `/dev/urandom` instead on Unix-like systems?

Since we don't need cryptographically secure random numbers, it seems like the `getrandom` crate dependency could be avoided on Unix-like systems, by reading seed bytes directly from `/dev/urandom` instead, and that this would on rare occasion even be slightly faster. It seems like that would be okay even if the system entropy pool is not, and was never, adequately seeded.

My guess is that this would be okay, but I am not sure. Besides (or sometimes together with) containers, one way someone might run the same or related programs many times concurrently is when deploying a task to many virtual machines. Normally that would be fine, but it may be that it is done early on after booting the virtual machines, which might be in use only for that task. This is relevant because, [as the `getrandom` readme says](https://github.com/rust-random/getrandom?tab=readme-ov-file#early-boot):

> Sometimes, early in the boot process, the OS has not collected enough entropy to securely seed its RNG. This is especially common on virtual machines, where standard "random" events are hard to come by.

So I worry that this approach would fix the problem across processes and containers but not fix it across virtual machines, and that this could actually in rare cases cause problems in practice. I *think* that this not a problem for our use case, because:

- Even if the OS cannot *securely seed* its pseudorandom number generator, reading still may give different data, especially if the separate reads are on the same machine, but even, I think, on different machines.
- The code in gitoxide that uses random numbers seems unlikely to be run "early in the boot process." At minimum, for a thundering herd problem involving a network resource, it would have to be after network interfaces are operational.

But I am not sure. This is assuming we have non-adversarial unpredictability even when we don't have adversarial unpredictability. That assumption may be wrong. Since implementing more strategies is more complicated than implementing fewer strategies, and it will be either hard or infeasible to test in the test suite, it should probably only be done if we are confident it will work properly. I am not confident.

#### Why we shouldn't just mix in some weird thing that happens to vary

We can't rely on things that seem to change, but that are not required to change, being different, and oddities will vary by target. Even attempting this would be complicated and probably involve more code that it would take to reimplement substantial fragments of libraries.

(Secondarily, I would worry about using anything that is not routinely exposed already and that might be in use to seed [ASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization), or to reveal its seed data. This includes the numerical addresses of objects, as the running process sees them, but it may include other information that varies across separate processes in a system-dependent way.)

#### Is it a bug in `fastrand` that its seeds clash across processes?

*Edit:* Clarified the title of this subsection.

I don't know if this would be considered a bug in how `fastrand` seeds its per-thread PRNGs by default. But my guess is that this may not be considered a bug, because while that functionality does depend on the `std` feature, it is intentionally pretty minimalistic, whereas even adding in the process IDs seems like it would be more extensive and not always available. I'm not sure, though.

### Git behavior

#### The `csprng_bytes` and `git_rand` functions

Outside of operations performed with CSPRNGs in cryptographic libraries, and outside of the test suite, Git uses pseudorandom numbers in a few ways. It obtains them, primarily from cryptographic libraries, though its `csprng_bytes` function, and its associated `git_rand` convenience function:

```text
$ git grep -Fn -e git_rand -e csprng -- '*.c' ':!t/'
builtin/gc.c:1912:      return git_rand(0) % 60;
reftable/stack.c:496:           delay = delay + (delay * git_rand(CSPRNG_BYTES_INSECURE)) / UINT32_MAX + 1;
reftable/stack.c:662:   uint32_t rnd = git_rand(CSPRNG_BYTES_INSECURE);
wrapper.c:482:          if (csprng_bytes(&v, sizeof(v), 0) < 0)
wrapper.c:753:int csprng_bytes(void *buf, size_t len, MAYBE_UNUSED unsigned flags)
wrapper.c:823:uint32_t git_rand(unsigned flags)
wrapper.c:827:  if (csprng_bytes(&result, sizeof(result), flags) < 0)
```

`git_rand` makes a 4-byte integer from data from `csprng_bytes`. Both functions are [declared in `wrapper.h`](https://github.com/git/git/blob/5f8f7081f7761acdf83d0a4c6819fe3d724f01d7/wrapper.h#L138-L149) and [defined in `wrapper.c`](https://github.com/git/git/blob/5f8f7081f7761acdf83d0a4c6819fe3d724f01d7/wrapper.c#L753-L831). In both functions, the parameter is not a seed, but an option controlling how robust the output is required to be.

When Git probes for the ability to create symlinks, it does not use those facilities.

#### Quality of pseudorandom numbers

As its name implies, `csprng_bytes` is primarily a cryptographically secure pseudorandom number generator. It uses several strategies. For at least one of those strategies, it is possible to fall back to less secure output when secure output is not available.

Whether this will be done depends on whether the `CSPRNG_BYTES_INSECURE` flag is passed. Currently it is the only recognized flag, so calls to `csprng_bytes`--or `git_rand`, which passes the flag along--pass either `0` or `CSPRNG_BYTES_INSECURE`.

Even when `CSPRNG_BYTES_INSECURE` is passed, such as when choosing delays in `reftable/stack.c`, I believe the pseudorandom data are well distributed and unlikely to cause a thundering herd condition or other clash-related problems.

Separately from that flag, if `git` is compiled without any of the cryptographic libraries it prefers to use for generating pseudorandom numbers, it will attempt to read from `/dev/urandom`. If this were somehow to happen during early boot, then the data could be predictable on some systems, but I *think* would not be the same across runs, as discussed above in "Expected behavior."

The [`RtlGenRandom`](https://learn.microsoft.com/en-us/windows/win32/api/ntsecapi/nf-ntsecapi-rtlgenrandom) case (`HAVE_RTLGENRANDOM` in `csprng_bytes`) applies to Windows. To the best of my knowledge, it is always available to `git` on Windows, but I have not investigated this.

#### The symlink probe

This is for comparison to case (2) presented above in the "Current behavior" section.

Git probes to see if it is able to create a symbolic link [like this](https://github.com/git/git/blob/5f8f7081f7761acdf83d0a4c6819fe3d724f01d7/setup.c#L2386-L2395), creating a regular file in a collision-resistant way and then deleting it and attempting to make a symlink of the same name:

```c
/* Check if symlink is supported in the work tree */
path = git_path_buf(&buf, "tXXXXXX");
if (!close(xmkstemp(path)) &&
	!unlink(path) &&
	!symlink("testing", path) &&
	!lstat(path, &st1) &&
	S_ISLNK(st1.st_mode))
	unlink(path); /* good */
else
	git_config_set("core.symlinks", "false");
```

That uses the `xmkstemp` function, defined [in `wrapper.c`](https://github.com/git/git/blob/5f8f7081f7761acdf83d0a4c6819fe3d724f01d7/wrapper.c#L422-L442). `xmkstemp` delegates to `mkstemp` (which is usually part of libc). `xmkstemp` checks for and bails out with a useful error message, but it does not customize how the temporary file is created, nor how its name is chosen.

An `mkstemp` implementation usually uses a PRNG, though sometimes not a very high quality one. However, it also performs an operation that is only expected to fail due to a collision, and it retries a number of times with different names when there is a collision.

Some other `mkstemp`-related functions that do not delegate to the system `mkstemp` function are defined in `wrapper.c`, in both the upstream Git repository and in Git for Windows. In particular, `git_mkstemp_mode` and `git_mkstemps_mode` accept a mode with which to create the temporary file. These functions are implemented fully in `wrapper.c`. They use `csprng_bytes` to fill in the `X` placeholders in their `pattern` arguments.

I believe it is only on Windows that the `mkstemp` implementation is part of Git itself, where it is [defined in `mingw.c`](https://github.com/git-for-windows/git/blob/0c796d3013a57e8cc894c152f0200107226e5dd1/compat/mingw.c#L1441-L1444). It delegates to `git_mkstemp_mode`, passing a mode of 0600.

### Steps to reproduce 🕹

This is hard to reproduce intentionally, and to my knowledge only (2) has been observed so far. This is as reported in #1789. In that question, I had thought this was a Windows-specific problem, but I no longer expect that to be so. For completeness, the CI output shown there is:

```text
        FAIL [   0.018s] gix-worktree-state-tests::worktree state::checkout::overwriting_files_and_lone_directories_works
──── STDOUT:             gix-worktree-state-tests::worktree state::checkout::overwriting_files_and_lone_directories_works

running 1 test
test state::checkout::overwriting_files_and_lone_directories_works ... FAILED

failures:

failures:
    state::checkout::overwriting_files_and_lone_directories_works

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 14 filtered out; finished in 0.00s

──── STDERR:             gix-worktree-state-tests::worktree state::checkout::overwriting_files_and_lone_directories_works
thread 'state::checkout::overwriting_files_and_lone_directories_works' panicked at gix-worktree-state\tests\state\checkout.rs:182:9:
The probe must detect to be able to generate symlinks
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

That assertion was introduced in d715e4a (#1652):

https://github.com/GitoxideLabs/gitoxide/blob/ffb73b5f69dbe86ff88f1c473af65f368a6bcbe5/gix-worktree-state/tests/state/checkout.rs#L182

So if the problem occurred before that when running the test suite, on CI or otherwise, it would either have caused a different (possibly less obvious) error, or not caused an error in the tests.

The most important case is (1) and it might be good to try to reproduce it intentionally. Maybe an approach with `hyperfine` could attempt this. I am not sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pseudorandom numbers are sometimes the same across processes #1816

Current behavior 😯

Background

The bug

The cause

Expected behavior 🤔

Possible solutions

Why not just mix in the PID?

We can use `getrandom` for the seed

Can we directly read `/dev/urandom` instead on Unix-like systems?

Why we shouldn't just mix in some weird thing that happens to vary

Is it a bug in `fastrand` that its seeds clash across processes?

Git behavior

The `csprng_bytes` and `git_rand` functions

Quality of pseudorandom numbers

The symlink probe

Steps to reproduce 🕹

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	let rand = fastrand::usize(..);
	let link_path = root.join(format!("__file_link{rand}"));

	let mut ports: Vec<_> = (9419u16..9419 + 100).collect();
	fastrand::shuffle(&mut ports);
	let addr_at = \|port\| std::net::SocketAddr::from(([127, 0, 0, 1], port));

Uh oh!

Pseudorandom numbers are sometimes the same across processes #1816

Description

Current behavior 😯

Background

The bug

The cause

Expected behavior 🤔

Possible solutions

Why not just mix in the PID?

We can use getrandom for the seed

Can we directly read /dev/urandom instead on Unix-like systems?

Why we shouldn't just mix in some weird thing that happens to vary

Is it a bug in fastrand that its seeds clash across processes?

Git behavior

The csprng_bytes and git_rand functions

Quality of pseudorandom numbers

The symlink probe

Steps to reproduce 🕹

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

We can use `getrandom` for the seed

Can we directly read `/dev/urandom` instead on Unix-like systems?

Is it a bug in `fastrand` that its seeds clash across processes?

The `csprng_bytes` and `git_rand` functions