-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diffArrays
RangeError: Maximum call stack size exceeded
#72
Comments
Also, ref: danger/danger-js#1019 |
I've implemented iterative/imperative versions of the Levenshtein algorithm — not here specifically, but AFAIK there's no technical / computer-science-y reason not to. I'm not sure that's the right solution though. In a way, I'm punting the problem of what to do with really long arrays down to the user's (default?) choice of stack size. It seems rude to IDK; what do you think it should do? If given two arrays of 100K+ elements, should it really embark on a potentially O(n²) algorithm? (Which, in the 100K case, implies up to 10M operations!) Realistically, I think your custom diff is the "right" solution in your case — you had really long arrays, realized the default behavior of the library blew up in that case, and so you overrode that default behavior. For some users, the "right" solution might be to increase their engine's stack size (as in #10). (Kinda just thinking out loud here...) What are we really trying to optimize? Maybe we say we want the smallest possible patch size. But what if that's potentially going to max out the CPU for some non-finite amount of time? There's a trade-off between how smart the library can be on whatever data you throw at it and it running down rabbit holes on pathological inputs. Where to draw the line? E.g., tl;dr: you bring up a good point, but it's not as simple as unfolding a recursive implementation. (Though that might be an incremental improvement — I'll have to think it over.) P.S. I think your P.P.S. I was initially worried you were running into the potential regression described in #15 (comment) which was recently released in v4, but I'm pretty sure that's not the case here, since your issue is blowing out the stack (more similar to #10), not just degraded performance on v4 vs. v3. |
I think there are two different concerns we should separate out here 🙂 I see them as:
Because there is a simple – and backwards compatible – way for this library to keep the same exact behavior, but mitigate the worst-case risk for the end-user, the answer to concern 1 seems fairly straightforward to me! Concern 2 comes with a lot more (very fun) optimization and use case questions that are definitely deserving of a standalone RFC 😅 (Happy to tag team drafting some proposals) P.S. Yup! Good catch. |
Hello! First off than you for a great library.
For reference, a version of this issue has been discussed previously here with no resolution: #10 (comment)
In my application I'm comparing objects that may contain very large arrays – sometimes up to 20-40k strings. Because this implementation's Levenshtein distance algorithm is recursive, it is guaranteed to blow out the stack every time it encounters one of these large objects.
A recursive algorithm may be more elegant, but when we can't know the data that may be coming through the pipes its a footgun waiting to happen.
I'm able to work around it by providing a custom diff function to abort operating on large arrays like so:
Not ideal since these arrays often have a lot of overlap! Have you experimented any more with an iterative version of Levenshtein for this repo?
The text was updated successfully, but these errors were encountered: