Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NamePermutationIterator #143

Merged
merged 8 commits into from
Jul 31, 2023
Merged

NamePermutationIterator #143

merged 8 commits into from
Jul 31, 2023

Conversation

ToBinio
Copy link
Contributor

@ToBinio ToBinio commented Jul 31, 2023

This PR adds the in #75 suggested NamesIter

open Questions:

  • how to handle too many -_?
    In case of more than 32 separators, the programs would just crash with an attempt to multiply with overflow.
    I don't think handling it as a Result<> would be reasonable as this case is not realistic and using Result<> would make the API worse (at least in my opinion)
  • should the iter return &str?

note:
I am not really happy with the code but every attempt of improving it just made it worse...

@Byron
Copy link
Collaborator

Byron commented Jul 31, 2023

Great work, thank you!

I have adjusted the structure and streamlined the tests a little.

While playing with the tests I noticed that it prefers _ over -. The goal here would be to optimize the output of the iterator to have the highest possible probability of producing a name that actually exists. So if _ would be most common, that would be the way to go but what if they are not?

I happen to have a database available with all available 121k crate names. The tally of - and _ is as follows:

  • - = 75260
  • _ = 31172

Thus - should be preferred when creating the names - even though going through 32k permutations currently takes 2ms, very little compared to the time that is burned when accessing the index. This makes finding good names fast paramount, so I inverted the logic to prefer -.

To understand better of what we are dealing with in terms of max-permutations, I counted the highest amount of - OR _ of all 121k crates, with the result being 8. Thus I reduced the total amount of possible permutations and made sure we fail gracefully.

Please find the original data here if you want to poke at it yourself (importing to SQLite while asking ChatGPT for queries works neatly): crate.csv.zip

Further, I think the first name produced should always be the input itself - I gave it a shot, please take a look. Maybe you have a smarter way of doing it - this way is quite acceptable performance wise as it makes the iterator 10% slower, and that's in the cards after making it nearly twice as fast.

should the iter return &str?

It's can't be an iterator anymore then, as lending iterators aren't supported with Rust's current borrow checker, unfortunately. Otherwise, definitely, but without that I think it's preferable to have it allocate for better usability.

The alternative would be to hand in &mut String, but I think it's not worth the reduced usability - 1ms for 32k permutations is good enough given that making that many queries to the index will cost many many seconds.

@Byron Byron merged commit abe5d70 into frewsxcv:master Jul 31, 2023
@Byron
Copy link
Collaborator

Byron commented Jul 31, 2023

If you want, you could contribute an example for Names that shows how to use it to implement a fuzzy crate_ lookup while limiting the amount of lookups to some number, just to show how much control the caller now has.

@ToBinio
Copy link
Contributor Author

ToBinio commented Jul 31, 2023

So if _ would be most common, that would be the way to go but what if they are not?

I thought about that and then quickly forgot it again...
and thanks for the interesting data

If you want, you could contribute an example for Names that shows how to use it to implement a fuzzy crate_ lookup while limiting the amount of lookups to some number, just to show how much control the caller now has.

do you mean simply adding a .take()?

another possible optimation could be to prefer all - and _, so it firstly returns the input then all _, and third all - and only then loop through all special combinations.
this should be doable by going through the solutions backwards.
starting at 0 then jumping back to max_count
but this would once again play with the -_ preference

@Byron
Copy link
Collaborator

Byron commented Jul 31, 2023

do you mean simply adding a .take()?

Yes, that would do, but that's just one of many ideas to indicate the flexibility that comes with it. Also to clarify, I meant that the contribution could either be a doc-test or maybe even an adjustment of one of the existing examples as long as there is user input.

another possible optimation could be to prefer all - and _, so it firstly returns the input then all _, and third all - and only then loop through all special combinations.

Definitely! That would be even better to have as these would be the most likely anyway. Ideally, we would then have the sequence input, all-hyphens, all_underscores, …all remaining permutations…, with built-in deduplication. I'd prefer a simple implementation over the most optimized one.


Thinking about it, you could also implement the count() method so that it computes the count in advance and avoids actual iteration. This should be documented so people know that they can easily implement their own limits prior to hitting them without paying for actually generating names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants