Skip to content

Improved compatibility for named capturing groups for ECMA mode #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dop251
Copy link
Contributor

@dop251 dop251 commented Apr 22, 2025

Hi,

This PR introduces a few changes to improve compatibility with the current ECMAScript specification in terms of handling of named capture groups:

  • The name of the group must be identifier-like, inlcuding Unicode characters (https://tc39.es/ecma262/#sec-static-semantics-capturinggroupname).
  • The group name may contain Unicode escaped characters (i.e. \uXXXX or \u{....}).
  • Named groups are assigned numbers in the order of appearance from left to right (i.e. in (.)(?<a>.)(.), a will be the group number 2).
  • Duplicate group names are not allowed.
  • Unnamed groups are not automatically given names that are decimal representations of the group numbers (i.e. their group.Name will be empty).

The changes only affect ECMAScript mode. i'm not 100% sure about the correctness of the group numbering changes, so I would appreciate a careful review.

Thanks!

Copy link
Owner

@dlclark dlclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be good to incorporate tests related to this change from https://github.com/tc39/test262/tree/master/test/built-ins/RegExp/named-groups

There's a lot of basic stuff in there that's already covered and can be skipped, but lots of the unicode and duplicate name stuff seems to hit lots of good edge cases.

@@ -853,6 +853,89 @@ func TestECMANamedGroup(t *testing.T) {
}
}

func TestECMAGroupNameUnicode(t *testing.T) {
t.Run("unicode-escape", func(t *testing.T) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is an ECMA-script only change can we get an inverse test (i.e. make sure this errors without the ECMAScript flag)?

regexp_test.go Outdated
})

t.Run("duplicate-name", func(t *testing.T) {
_, err := Compile(`(?<a>a)(?<a>a)`, ECMAScript)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make sure the non-ECMAScript behavior is unchanged

@dop251
Copy link
Contributor Author

dop251 commented Apr 23, 2025

I agree it would be good, unfortunately it's no trivial task, it would mean going through them one by one and manually translating each...

@dop251
Copy link
Contributor Author

dop251 commented Apr 23, 2025

FWIW, these tests are part of goja tests, and most of them pass (the ones not explicitly excluded).

@dlclark
Copy link
Owner

dlclark commented Apr 24, 2025

@dop251 the goja tests pass with the current regexp2 code or after this PR? Can we include more of the test262 regexp tests in goja with this PR to test it?

Just doing a pass through I think it'd be good to add these test patterns to regexp2 with their expected groups/names checks:

assert.compareArray(/(?<x>b)|(?<x>a)/.exec("bab"), ["b", "b", undefined]);
assert.compareArray(/(?:(?<x>a)|(?<x>b))\k<x>/.exec("aa"), ["aa", "a", undefined]);
assert.compareArray(["a", "a"], "bab".match(/(?<$>a)/));
assert.compareArray(["a", "a"], "bab".match(/(?<_>a)/));
assert.throws(SyntaxError, function() {
    return new RegExp("(?<🦊>fox)", "u");
});

Specifically looking at repeated names in alternations, backreferences with dupe names, and valid and invalid funky group names.

@dop251
Copy link
Contributor Author

dop251 commented Apr 25, 2025

Repeated names in alternations is a relatively new feature (https://github.com/tc39/proposal-duplicate-named-capturing-groups) and is not implemented in this PR. If you know a way to do this, please let me know.

As for the valid and invalid characters, I'll have a look an try to add some of the tests.

@dlclark
Copy link
Owner

dlclark commented Apr 25, 2025

Repeated names in alternations works today in regexp2 (with and without the EcmaScript option), so I want to preserve backward compatibility with existing code and patterns out there.

@dop251
Copy link
Contributor Author

dop251 commented Apr 25, 2025

The problem is it works regardless of whether they are in different alternatives or not, i.e. (?<a>a)(?<a>a) works, but it shouldn't until the proposal I've mentioned is accepted. In the tc39 tests it's marked with a specific feature flag (regexp-duplicate-named-groups).

The ideal case of course would be implementing this feature correctly, however I couldn't find an easy way to do that.

Moreover, if I just remove the restriction the group numbering will break (see https://github.com/dlclark/regexp2/pull/96/files#diff-c4ec48a56dcc27cb25a6d76ceaef555b17e75c7c398c5e31b73b29c462b3d908R973)

@federicotdn
Copy link
Contributor

(?a)(?a) works, but it shouldn't until the proposal I've mentioned is accepted.

I believe though that this example should not work even after the proposal is accepted, because there a is declared twice in the same alternative.

So I think there's these problems:

  • (?<a>a)(?<a>a) works now in regexp2 w/ ECMAScript flag, but it shouldn't, even after the proposal is accepted.
  • (?<a>a)|(?<a>a) works now in regexp2 w/ ECMAScript flag, but it shouldn't.
  • (?<a>a)|(?<a>a) will need to work in regexp2 w/ECMASCript flag, once the proposal is accepted.

I was thinking that if we are restricted by backward compatibility, we could create two new flags for ECMAScript, e.g.:

  • ECMAScriptBasic
  • ECMAScriptExtended

ECMAScriptExtended would include everything from Basic.
We would then create an alias for ECMAScriptBasic called just ECMAScript, to maintain older code working:

const (
	IgnoreCase              RegexOptions = 0x0001 // "i"
	Multiline                            = 0x0002 // "m"
	ExplicitCapture                      = 0x0004 // "n"
	Compiled                             = 0x0008 // "c"
	Singleline                           = 0x0010 // "s"
	IgnorePatternWhitespace              = 0x0020 // "x"
	RightToLeft                          = 0x0040 // "r"
	Debug                                = 0x0080 // "d"
	ECMAScriptBasic                      = 0x0100 // "e"
	RE2                                  = 0x0200 // RE2 compat mode
	Unicode                              = 0x0400 // "u"
	ECMAScriptExtended                   = 0x0800
	ECMAScript                           = ECMAScriptBasic
)

Then we would have more freedom to make changes for ECMAScriptExtended.

@dlclark
Copy link
Owner

dlclark commented Apr 29, 2025

I agree with the concept of a new Option for a different type of EcmaScript that is documented something along the lines of: “stricter adherence to the EcmaScript regex standard, including removal of features that regexp2 supports. Prioritizes EcmaScript standard adherence over compatibility changes and future changes may break backward compatibility if they are in line with the EcmaScript standard.”

That way developers can opt-in to future possible breaking changes like this.

We would also need to document the differences between these two EcmaScript-style modes.

EcmaScriptBasic and EcmaScriptStrict?

@dop251
Copy link
Contributor Author

dop251 commented Apr 29, 2025

I think this will just cause confusion. There is only one ECMAScript standard and making sure that all previous regexes continue to work should not be an absolute goal, in my opinion. If they used to work due to a bug or a missing feature I think it's ok if they break. People can always keep an older version of regexp2 until they fix their regexes.

@federicotdn
Copy link
Contributor

I agree with the concept of a new Option for a different type of EcmaScript that is documented something along the lines of: “stricter adherence to the EcmaScript regex standard, including removal of features that regexp2 supports. Prioritizes EcmaScript standard adherence over compatibility changes and future changes may break backward compatibility if they are in line with the EcmaScript standard.”

That would be great 👍🏻

EcmaScriptBasic and EcmaScriptStrict?

I thought about using "strict" as well but then I realized it will name-clash with https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Strict_mode, and I don't believe strict mode in JS changes anything regexp-related, so it would be a bit confusing.

@dop251
Copy link
Contributor Author

dop251 commented Apr 29, 2025

It will also create confusion as to which version should which change go into. For example, group names starting with numbers were not allowed in ECMAScript from the very start. Should it then go into ECMAScriptBasic or ECMAScriptExtended? If the latter, then ECMAScriptBasic isn't actually ECMAScript-conformant.

@dlclark
Copy link
Owner

dlclark commented Apr 29, 2025

The reason for the separate Option isn’t because of upcoming EcmaScript changes. It’s purely about change management of the regexp2 package.

The current EcmaScript option was not designed to be fully compatible with EcmaScript, it is more compatible based on specific behaviors (some char classes, octal parsing, and some back reference stuff), but it’s not compliant. It’s a mish-mash of features from EcmaScript regex and C# regex and was inherited from the original C# version of the engine. A flag that means “strict adherence to EcmaScript” is a new concept that I’m willing to add, but I’m not willing to re-purpose the existing flag to mean this.

So in this example group names not allowing numbers would go into Strict mode.

Also, I agree that naming it Strict could be confusing in this context—maybe something like EcmaScriptCompliant is more accurate for the focus of the new mode anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants