-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support negation of entire pattern (not look-ahead/-behind) #441
Comments
If anything, I think this should really be done with an option like |
I think (?!) is a PCRE expression that never matches anything at all. Repurposing it with a different meaning seems like a mistake. Making up an RE2-specific syntax also seems like a mistake. Our answer from the beginning has been that if you want to incorporate boolean formulas on regexps such as NOT and AND, the place to do that is in a higher-level API above the regexp API. Also the pattern given for matching any text that does not contain "word" is incorrect, further driving home that the best way to write this is |
Additionally, there would probably be API oddities to figure out. For example, what happens when you ask for submatches for one of these negated patterns? Or what happens when you ask for the match offsets? There are perhaps answers to those questions, but they're likely to be a bit ham-fisted I think. I overall agree that this should be handled at the application level. |
I'm perfectly aware that the example pattern is not the way negation should be done, but I can't put Matching groups or offsets are not relevant. If one wants negation, the result would be the same as for a positive pattern that failed to match, hence the result would contain neither. |
And in the case when the pattern matches...? You're only considering the non-match case. |
Because negation implies a boolean operation, e.g. in a filter. If the result was positive, the match result would be the entire subject string with no subgroups, because the subgroups in the pattern would not have been found. |
That's exactly the problem... Today, you can assume that if You can't wave this away because you have to define the behavior of a negated regex on these other APIs. You can't just say "this only applies to a boolean match" because regex objects have more operations defined on them than boolean matches. There are more issues lurking here I think. That is to say, I don't think this feature request is fully specified. |
FWIW, Sulzmann and van Steenhoven addressed this issue of capturing groups within negation in A Flexible and Efficient ML Lexer Tool based on Extended Regular Expression Submatching section 4.1:
I had wondered whether RE2 could optimise away any capturing groups of a negated regular expression, but this code would prevent that. Moreover, like @BurntSushi said, the caller would need to be prepared for the regular expression to match, but with null capturing groups. Note that "change the caller to handle this" is strictly worse than "change the caller to handle the negation at the application level" because not doing the former breaks an entire class of existing usage. At this point, I'm not sure that there's any satisfactory path forwards. P.S. Hyperscan has supported logical combinations for the last five years. It's worth studying the design and implementation of that feature if you wish to pursue this matter further. Especially because I wouldn't be keen to plumb negation through the internals of |
I didn't. I said the match would be the input string, i.e. only the 0th group and no subgroups, not null groups.
Indeed, but if the pattern were If it's as difficult to achieve as @junyer outlines, then that's a pity. Having |
This doesn't address the assumption that callers can make today about all possible regex patterns. It fails to hold with this new negation operator. Bottom line is that this feature request needs far more detail to be fully specified. And I agree that this is not easy to implement and likely has more as yet seen difficulties beyond what has been pointed out so far.
You are though, because in some of the regex engines (like the pikevm) the 0th group is treated no differently than the other groups.
The limits are exactly the point of using an engine like RE2 in the first place. |
Let go of the assumption that matching groups should appear in the result. The result would be that of a positive pattern, e.g.
Which I acknowledged already. There is such a thing as too limiting. I'me seeing RE2 in all sorts of places now, crippling simple tasks. |
You quoted me out of context unfortunately. It's not just my assumption. |
Why should any caller be able to look for parentheses in a pattern, without actually parsing it, then assume all possible results will contain subgroups? None of what you've said attests that these assumptions are correct. Patterns like It's like a person asking "Can you look in this barrel to see if there are no apples?" and having the searcher's answer be "I cannot tell you because I don't know exactly where to tell you they aren't." |
(This may be out of context.) E.g.:
It was originally proposed in Tanaka Akira's paper [PDF] (which was written in Japanese). |
Like I said back in August, RE2 won't unilaterally add syntax. Navigating the Go proposal review process may not be a fruitful exercise, but the discussion here has run its course and, as such, I assure you that continuing will not be a fruitful exercise. :) |
Many services, including Google's own, allow input of an RE2 pattern to perform some task, but only as a positive match. They frequently have no application-level provision to negate the pattern, and aren't interested in feature change requests. A pattern sequence could be provided to achieve this, without the danger assertions pose, by specifying one that would otherwise cause a syntax error. Beginning a pattern with
(?!)
could work as it is similar to PCRE's negative look-ahead assertion grouping but contains nothing. The matching process could consume the sequence as a flag, perform normal, linear matching with the remainder of the pattern, then apply the flag to negate the final result.Apache does something similar with the
!
pattern prefix, but this prefix would not be backwards-compatible with RE2.As it is now, the only way to do a negative match, when the user has no control over entire pattern negation, is to write monstrosities like this:
This example prevents matching
word
but still suffers from the Scunthorpe problem.The text was updated successfully, but these errors were encountered: