Skip to content

choose_multiple should return Result or Option #1620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
serhii-havrylov opened this issue Apr 1, 2025 · 12 comments · May be fixed by #1632
Open

choose_multiple should return Result or Option #1620

serhii-havrylov opened this issue Apr 1, 2025 · 12 comments · May be fixed by #1632
Labels
X-bug Type: bug report

Comments

@serhii-havrylov
Copy link

serhii-havrylov commented Apr 1, 2025

I mistakenly assumed that choose_multiple functions like Python's random.choices, which performs sampling with replacement. This led to unexpected behaviour in the following Rust code:

let sample =  5..20.choose_multiple(&mut rng, my_vec.len())

Here, my_vec contains significantly more than 15 elements, but sample ends up with only 15 elements. This behaviour can be surprising and may introduce bugs.

Since choose_multiple performs sampling without replacement, it would make more sense for it to return a Result or Option to return error or None when the requested number of elements cannot be selected without replacement.

@serhii-havrylov serhii-havrylov added the X-bug Type: bug report label Apr 1, 2025
@dhardy
Copy link
Member

dhardy commented Apr 1, 2025

You are talking about IteratorRandom::choose_multiple? The documentation states:

The length of the returned vector equals amount unless the iterator contains insufficient elements, in which case it equals the number of elements available.

Hence your suggestion would be a breaking change to both the API and the specification. I am not convinced that your motivation is sufficient.

Also, while it is common in Python (and Matlab and many other scripting languages) to specify a length or matrix dimension in order to generate multiple samples, this is not common practice in languages like Rust where it is possible to sequentially generate multiple samples without (significant) overhead (especially when using a distribution object like Uniform or Choose).

@dhardy dhardy closed this as completed Apr 1, 2025
@dhardy
Copy link
Member

dhardy commented Apr 1, 2025

Note: I don't consider the name choose_multiple ideal, but don't see a good alternative that doesn't add significant length.

@serhii-havrylov
Copy link
Author

serhii-havrylov commented Apr 1, 2025

I am not convinced that your motivation is sufficient.

Fair enough, but it would have been nice to have the compiler to encourage me to think about this, rather than simply hoping the user's read the documentation. I'm fairly certain that there's a significant overlap between people using the rand crate and those coming to Rust from a Python background. As a result, the current behaviour could potentially be bug-prone.

@dhardy
Copy link
Member

dhardy commented Apr 1, 2025

Fair point.

Perhaps it should be renamed to choose_without_replacement.

Given that we also have variants of this method with mut, fill and weighted suffixes, the above name is too long though.

@dhardy dhardy reopened this Apr 1, 2025
@serhii-havrylov
Copy link
Author

serhii-havrylov commented Apr 1, 2025

Perhaps methods like choose_multiple_distinct or choose_multiple_unique would be somewhat shorter.

On a tangential note, why are these methods for sampling without replacement named with choose? There's already rand::seq::index::sample, which also performs sampling without replacement. Using sample as the method name would be more consistent with Python's naming conventions and might reduce confusion for people coming from python.

@vks
Copy link
Collaborator

vks commented Apr 2, 2025

I support renaming choose_* to sample_*, which seems clearer to me. I agree that "choose" suggests sampling with replacement.

I'm also not convinced we should return a result or option, because returning None seems redundant to returning an empty Vec.

@dhardy
Copy link
Member

dhardy commented Apr 7, 2025

I agree that "choose" suggests sampling with replacement.

But seq.choose_multiple(rng, 1) is mostly equivalent to seq.choose(rng): neither modifies seq.

We could go with an abbreviated variant of @serhii-havrylov's suggestion: seq.choose_distinct(rng, amount). Or seq.sample_distinct(rng, amount).

@dhardy
Copy link
Member

dhardy commented Apr 10, 2025

So, we have:

  • fn IteratorRandom::choose (one elt, consumes iterator)
  • fn IteratorRandom::choose_stable (variant of above whose result is unaffected by Iterator::size_hint)
  • fn IteratorRandom::choose_multiple (multiple elts, obviously distinct)
  • fn IteratorRandom::choose_multiple_fill (variant of above using an output buffer)
  • fn IndexedRandom::choose (one elt, does not modify sequence)
  • fn IndexedRandom::choose_multiple (multiple distinct elts)
  • fn IndexedRandom::choose_multiple_array (variant of above returning an array, for uses requiring a small fixed number of results)
  • fn IndexedRandom::choose_weighted (one elt, biased)
  • fn IndexedRandom::choose_multiple_weighted (multiple distinct elts, biased)

For iterators, the word "choose" intuitively makes sense to me: the RNG chooses one or multiple elements to keep and discards the rest.

For sequences, the concept carries over well enough for single elements but is less clear for multiple elements. The word "sample" is perhaps more appropriate given that the operation does not modify the input sequence.

I suggest some modifications (to IndexedRandom methods only): rename choosesample, rename existing usages of multipledistinct; add sample_iter and adjust sample_weighted:

  • fn IndexedRandom::sample(&self, rng: &mut R) -> Option<&Self::Output>
  • fn IndexedRandom::sample_iter(&self, rng: &mut R) -> SliceUniformIter<'_, Self, Self::Output>
  • fn IndexedRandom::sample_distinct(&self, rng: &mut R, amount: usize) -> SliceIndexIter<'_, Self, Self::Output>
  • fn IndexedRandom::sample_distinct_array(&self, rng: &mut R) -> Option<[Self::Output; N]>
  • fn IndexedRandom::sample_weighted(&self, rng: &mut R, weight: F) -> Result<SliceWeightedIter<'_, Self, Self::Output>, WeightError>
  • fn IndexedRandom::sample_distinct_weighted(&self, rng: &mut R, weight: F) -> Result<SliceIndexIter<'_, Self, Self::Output>, WeightError>

@serhii-havrylov
Copy link
Author

I just wanted to add my two cents as someone transitioning to Rust from Python. While I agree that choose makes sense for iterators - since you're making a choice to either keep or discard an element one at a time - it can still be a potential pitfall for Python users due to the "false friend" issue. In Python, choices is named that way because you're making repeated "choices", where each choice can potentially be the same as a previous one, highlighting the possibility of repetition.

Additionally, when writing the actual code, it can sometimes be immediately unclear from the code whether you're working with the iterator or the collection itself. This ambiguity can lead to confusion, especially for those familiar with Python's approach.

@dhardy
Copy link
Member

dhardy commented Apr 10, 2025

Thanks for the feedback. So you're saying that it might make sense to:

  • Use choose for an iterator and for multiple distinct element sampling
  • Use choices for sampling multiple elements with replacement

In that case we get:

  • fn IndexedRandom::choose (one elt)
  • fn IndexedRandom::choices (multiple with replacement, likely as an iterator of arbitrary length)
  • fn IndexedRandom::weighted_choices
  • fn IndexedRandom::choose_distinct or distinct_choices (multiple without replacement)
  • fn IndexedRandom::choose_weighted_distinct or distinct_weighted_choices (multiple without replacement)

That's a little different from the above but maybe more consistent with iterators (which still have choose, choose_multiple, ...).

Edit: added fn distinct_choices as an alternative.

@serhii-havrylov
Copy link
Author

serhii-havrylov commented Apr 10, 2025

I guess it's reasonable to stick with just one verb, "choose," rather than using two verbs like "choose" and "sample." You could use choose_distinct in contexts where sample is typically used in Python, but the downside is that choose_distinct is a bit longer than simply using sample. On the other hand, choose_distinct more clearly conveys that the sampling is done without replacement, compared to sample. However, my original suggestion was to use sample for iterators as well to emphasize, particularly for those with a Python background, that it implies sampling without replacement.

@dhardy
Copy link
Member

dhardy commented Apr 19, 2025

If we're talking grammar — 'choice' is a noun (or sometimes an adjective), not a verb. So it might make more sense to use choose_many / choose_many_weighted rather than copy Python.

We already have fn Distribution::sample to yield a single sample from some fixed distribution. We could give iterators and slices a sample method, perhaps? sample_uniform makes more sense conceptually, but is rather pedantic (@vks?).

So maybe we could do this:

  • IteratorRandom::choose(self, rng) is kept
  • IteratorRandom::choose_multiple(self, rng, amount)sample(self, rng, amount)
  • IndexedRandom::choose(self, rng) is kept
  • IndexedRandom::choose_multiple(self, rng, amount)sample(self, rng, amount)
  • IndexedRandom::choose_many(self, rng, amount) is added (sampling with replacement)
  • Also IteratorRandom::sample_fill, IndexedRandom::sample_array, sample_weighted and maybe IndexedRandom::choose_many_weighted

@dhardy dhardy linked a pull request Apr 19, 2025 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-bug Type: bug report
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants