-
Notifications
You must be signed in to change notification settings - Fork 86
Improved compatibility for named capturing groups for ECMA mode #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it'd be good to incorporate tests related to this change from https://github.com/tc39/test262/tree/master/test/built-ins/RegExp/named-groups
There's a lot of basic stuff in there that's already covered and can be skipped, but lots of the unicode and duplicate name stuff seems to hit lots of good edge cases.
@@ -853,6 +853,89 @@ func TestECMANamedGroup(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestECMAGroupNameUnicode(t *testing.T) { | |||
t.Run("unicode-escape", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is an ECMA-script only change can we get an inverse test (i.e. make sure this errors without the ECMAScript flag)?
regexp_test.go
Outdated
}) | ||
|
||
t.Run("duplicate-name", func(t *testing.T) { | ||
_, err := Compile(`(?<a>a)(?<a>a)`, ECMAScript) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make sure the non-ECMAScript behavior is unchanged
…nges do not affect non-ECMAScript mode.
I agree it would be good, unfortunately it's no trivial task, it would mean going through them one by one and manually translating each... |
FWIW, these tests are part of goja tests, and most of them pass (the ones not explicitly excluded). |
@dop251 the goja tests pass with the current regexp2 code or after this PR? Can we include more of the test262 regexp tests in goja with this PR to test it? Just doing a pass through I think it'd be good to add these test patterns to regexp2 with their expected groups/names checks:
Specifically looking at repeated names in alternations, backreferences with dupe names, and valid and invalid funky group names. |
Repeated names in alternations is a relatively new feature (https://github.com/tc39/proposal-duplicate-named-capturing-groups) and is not implemented in this PR. If you know a way to do this, please let me know. As for the valid and invalid characters, I'll have a look an try to add some of the tests. |
Repeated names in alternations works today in |
The problem is it works regardless of whether they are in different alternatives or not, i.e. The ideal case of course would be implementing this feature correctly, however I couldn't find an easy way to do that. Moreover, if I just remove the restriction the group numbering will break (see https://github.com/dlclark/regexp2/pull/96/files#diff-c4ec48a56dcc27cb25a6d76ceaef555b17e75c7c398c5e31b73b29c462b3d908R973) |
I believe though that this example should not work even after the proposal is accepted, because there So I think there's these problems:
I was thinking that if we are restricted by backward compatibility, we could create two new flags for ECMAScript, e.g.:
const (
IgnoreCase RegexOptions = 0x0001 // "i"
Multiline = 0x0002 // "m"
ExplicitCapture = 0x0004 // "n"
Compiled = 0x0008 // "c"
Singleline = 0x0010 // "s"
IgnorePatternWhitespace = 0x0020 // "x"
RightToLeft = 0x0040 // "r"
Debug = 0x0080 // "d"
ECMAScriptBasic = 0x0100 // "e"
RE2 = 0x0200 // RE2 compat mode
Unicode = 0x0400 // "u"
ECMAScriptExtended = 0x0800
ECMAScript = ECMAScriptBasic
) Then we would have more freedom to make changes for |
I agree with the concept of a new Option for a different type of EcmaScript that is documented something along the lines of: “stricter adherence to the EcmaScript regex standard, including removal of features that regexp2 supports. Prioritizes EcmaScript standard adherence over compatibility changes and future changes may break backward compatibility if they are in line with the EcmaScript standard.” That way developers can opt-in to future possible breaking changes like this. We would also need to document the differences between these two EcmaScript-style modes. EcmaScriptBasic and EcmaScriptStrict? |
I think this will just cause confusion. There is only one ECMAScript standard and making sure that all previous regexes continue to work should not be an absolute goal, in my opinion. If they used to work due to a bug or a missing feature I think it's ok if they break. People can always keep an older version of regexp2 until they fix their regexes. |
That would be great 👍🏻
I thought about using "strict" as well but then I realized it will name-clash with https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Strict_mode, and I don't believe strict mode in JS changes anything regexp-related, so it would be a bit confusing. |
It will also create confusion as to which version should which change go into. For example, group names starting with numbers were not allowed in ECMAScript from the very start. Should it then go into ECMAScriptBasic or ECMAScriptExtended? If the latter, then ECMAScriptBasic isn't actually ECMAScript-conformant. |
The reason for the separate Option isn’t because of upcoming EcmaScript changes. It’s purely about change management of the regexp2 package. The current EcmaScript option was not designed to be fully compatible with EcmaScript, it is more compatible based on specific behaviors (some char classes, octal parsing, and some back reference stuff), but it’s not compliant. It’s a mish-mash of features from EcmaScript regex and C# regex and was inherited from the original C# version of the engine. A flag that means “strict adherence to EcmaScript” is a new concept that I’m willing to add, but I’m not willing to re-purpose the existing flag to mean this. So in this example group names not allowing numbers would go into Strict mode. Also, I agree that naming it Strict could be confusing in this context—maybe something like EcmaScriptCompliant is more accurate for the focus of the new mode anyway. |
Hi,
This PR introduces a few changes to improve compatibility with the current ECMAScript specification in terms of handling of named capture groups:
(.)(?<a>.)(.)
,a
will be the group number 2).The changes only affect ECMAScript mode. i'm not 100% sure about the correctness of the group numbering changes, so I would appreciate a careful review.
Thanks!