Improved compatibility for named capturing groups for ECMA mode #96

dop251 · 2025-04-22T22:58:31Z

Hi,

This PR introduces a few changes to improve compatibility with the current ECMAScript specification in terms of handling of named capture groups:

The name of the group must be identifier-like, inlcuding Unicode characters (https://tc39.es/ecma262/#sec-static-semantics-capturinggroupname).
The group name may contain Unicode escaped characters (i.e. \uXXXX or \u{....}).
Named groups are assigned numbers in the order of appearance from left to right (i.e. in (.)(?<a>.)(.), a will be the group number 2).
Duplicate group names are not allowed.
Unnamed groups are not automatically given names that are decimal representations of the group numbers (i.e. their group.Name will be empty).

The changes only affect ECMAScript mode. i'm not 100% sure about the correctness of the group numbering changes, so I would appreciate a careful review.

Thanks!

dlclark

I wonder if it'd be good to incorporate tests related to this change from https://github.com/tc39/test262/tree/master/test/built-ins/RegExp/named-groups

There's a lot of basic stuff in there that's already covered and can be skipped, but lots of the unicode and duplicate name stuff seems to hit lots of good edge cases.

dlclark · 2025-04-23T02:07:34Z

regexp_test.go

@@ -853,6 +853,89 @@ func TestECMANamedGroup(t *testing.T) {
 	}
 }

+func TestECMAGroupNameUnicode(t *testing.T) {
+	t.Run("unicode-escape", func(t *testing.T) {


since this is an ECMA-script only change can we get an inverse test (i.e. make sure this errors without the ECMAScript flag)?

dlclark · 2025-04-23T02:08:37Z

regexp_test.go

+	})
+
+	t.Run("duplicate-name", func(t *testing.T) {
+		_, err := Compile(`(?<a>a)(?<a>a)`, ECMAScript)


Lets make sure the non-ECMAScript behavior is unchanged

…nges do not affect non-ECMAScript mode.

dop251 · 2025-04-23T20:14:06Z

I agree it would be good, unfortunately it's no trivial task, it would mean going through them one by one and manually translating each...

dop251 · 2025-04-23T20:15:33Z

FWIW, these tests are part of goja tests, and most of them pass (the ones not explicitly excluded).

dlclark · 2025-04-24T02:44:07Z

@dop251 the goja tests pass with the current regexp2 code or after this PR? Can we include more of the test262 regexp tests in goja with this PR to test it?

Just doing a pass through I think it'd be good to add these test patterns to regexp2 with their expected groups/names checks:

assert.compareArray(/(?<x>b)|(?<x>a)/.exec("bab"), ["b", "b", undefined]);
assert.compareArray(/(?:(?<x>a)|(?<x>b))\k<x>/.exec("aa"), ["aa", "a", undefined]);
assert.compareArray(["a", "a"], "bab".match(/(?<$>a)/));
assert.compareArray(["a", "a"], "bab".match(/(?<_>a)/));
assert.throws(SyntaxError, function() {
    return new RegExp("(?<🦊>fox)", "u");
});

Specifically looking at repeated names in alternations, backreferences with dupe names, and valid and invalid funky group names.

dop251 · 2025-04-25T21:04:39Z

Repeated names in alternations is a relatively new feature (https://github.com/tc39/proposal-duplicate-named-capturing-groups) and is not implemented in this PR. If you know a way to do this, please let me know.

As for the valid and invalid characters, I'll have a look an try to add some of the tests.

dlclark · 2025-04-25T22:30:17Z

Repeated names in alternations works today in regexp2 (with and without the EcmaScript option), so I want to preserve backward compatibility with existing code and patterns out there.

dop251 · 2025-04-25T23:18:35Z

The problem is it works regardless of whether they are in different alternatives or not, i.e. (?<a>a)(?<a>a) works, but it shouldn't until the proposal I've mentioned is accepted. In the tc39 tests it's marked with a specific feature flag (regexp-duplicate-named-groups).

The ideal case of course would be implementing this feature correctly, however I couldn't find an easy way to do that.

Moreover, if I just remove the restriction the group numbering will break (see https://github.com/dlclark/regexp2/pull/96/files#diff-c4ec48a56dcc27cb25a6d76ceaef555b17e75c7c398c5e31b73b29c462b3d908R973)

federicotdn · 2025-04-29T10:15:54Z

(?a)(?a) works, but it shouldn't until the proposal I've mentioned is accepted.

I believe though that this example should not work even after the proposal is accepted, because there a is declared twice in the same alternative.

So I think there's these problems:

(?<a>a)(?<a>a) works now in regexp2 w/ ECMAScript flag, but it shouldn't, even after the proposal is accepted.
(?<a>a)|(?<a>a) works now in regexp2 w/ ECMAScript flag, but it shouldn't.
(?<a>a)|(?<a>a) will need to work in regexp2 w/ECMASCript flag, once the proposal is accepted.

I was thinking that if we are restricted by backward compatibility, we could create two new flags for ECMAScript, e.g.:

ECMAScriptBasic
ECMAScriptExtended

ECMAScriptExtended would include everything from Basic.
We would then create an alias for ECMAScriptBasic called just ECMAScript, to maintain older code working:

const (
	IgnoreCase              RegexOptions = 0x0001 // "i"
	Multiline                            = 0x0002 // "m"
	ExplicitCapture                      = 0x0004 // "n"
	Compiled                             = 0x0008 // "c"
	Singleline                           = 0x0010 // "s"
	IgnorePatternWhitespace              = 0x0020 // "x"
	RightToLeft                          = 0x0040 // "r"
	Debug                                = 0x0080 // "d"
	ECMAScriptBasic                      = 0x0100 // "e"
	RE2                                  = 0x0200 // RE2 compat mode
	Unicode                              = 0x0400 // "u"
	ECMAScriptExtended                   = 0x0800
	ECMAScript                           = ECMAScriptBasic
)

Then we would have more freedom to make changes for ECMAScriptExtended.

dlclark · 2025-04-29T16:50:59Z

I agree with the concept of a new Option for a different type of EcmaScript that is documented something along the lines of: “stricter adherence to the EcmaScript regex standard, including removal of features that regexp2 supports. Prioritizes EcmaScript standard adherence over compatibility changes and future changes may break backward compatibility if they are in line with the EcmaScript standard.”

That way developers can opt-in to future possible breaking changes like this.

We would also need to document the differences between these two EcmaScript-style modes.

EcmaScriptBasic and EcmaScriptStrict?

dop251 · 2025-04-29T18:10:08Z

I think this will just cause confusion. There is only one ECMAScript standard and making sure that all previous regexes continue to work should not be an absolute goal, in my opinion. If they used to work due to a bug or a missing feature I think it's ok if they break. People can always keep an older version of regexp2 until they fix their regexes.

federicotdn · 2025-04-29T18:10:40Z

I agree with the concept of a new Option for a different type of EcmaScript that is documented something along the lines of: “stricter adherence to the EcmaScript regex standard, including removal of features that regexp2 supports. Prioritizes EcmaScript standard adherence over compatibility changes and future changes may break backward compatibility if they are in line with the EcmaScript standard.”

That would be great 👍🏻

EcmaScriptBasic and EcmaScriptStrict?

I thought about using "strict" as well but then I realized it will name-clash with https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Strict_mode, and I don't believe strict mode in JS changes anything regexp-related, so it would be a bit confusing.

dop251 · 2025-04-29T18:17:10Z

It will also create confusion as to which version should which change go into. For example, group names starting with numbers were not allowed in ECMAScript from the very start. Should it then go into ECMAScriptBasic or ECMAScriptExtended? If the latter, then ECMAScriptBasic isn't actually ECMAScript-conformant.

dlclark · 2025-04-29T19:27:21Z

The reason for the separate Option isn’t because of upcoming EcmaScript changes. It’s purely about change management of the regexp2 package.

The current EcmaScript option was not designed to be fully compatible with EcmaScript, it is more compatible based on specific behaviors (some char classes, octal parsing, and some back reference stuff), but it’s not compliant. It’s a mish-mash of features from EcmaScript regex and C# regex and was inherited from the original C# version of the engine. A flag that means “strict adherence to EcmaScript” is a new concept that I’m willing to add, but I’m not willing to re-purpose the existing flag to mean this.

So in this example group names not allowing numbers would go into Strict mode.

Also, I agree that naming it Strict could be confusing in this context—maybe something like EcmaScriptCompliant is more accurate for the focus of the new mode anyway.

Improved compatibility for named capture groups for ECMA mode

fb16ed8

dlclark requested changes Apr 23, 2025

View reviewed changes

Do not name group 0 in ECMAScript mode, added tests to ensure the cha…

93eb3e7

…nges do not affect non-ECMAScript mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved compatibility for named capturing groups for ECMA mode #96

Improved compatibility for named capturing groups for ECMA mode #96

dop251 commented Apr 22, 2025

dlclark left a comment •

edited

Loading

dlclark Apr 23, 2025

dlclark Apr 23, 2025

dop251 commented Apr 23, 2025

dop251 commented Apr 23, 2025

dlclark commented Apr 24, 2025 •

edited

Loading

dop251 commented Apr 25, 2025

dlclark commented Apr 25, 2025

dop251 commented Apr 25, 2025

federicotdn commented Apr 29, 2025

dlclark commented Apr 29, 2025 •

edited

Loading

dop251 commented Apr 29, 2025

federicotdn commented Apr 29, 2025

dop251 commented Apr 29, 2025

dlclark commented Apr 29, 2025

Improved compatibility for named capturing groups for ECMA mode #96

Are you sure you want to change the base?

Improved compatibility for named capturing groups for ECMA mode #96

Conversation

dop251 commented Apr 22, 2025

dlclark left a comment • edited Loading

Choose a reason for hiding this comment

dlclark Apr 23, 2025

Choose a reason for hiding this comment

dlclark Apr 23, 2025

Choose a reason for hiding this comment

dop251 commented Apr 23, 2025

dop251 commented Apr 23, 2025

dlclark commented Apr 24, 2025 • edited Loading

dop251 commented Apr 25, 2025

dlclark commented Apr 25, 2025

dop251 commented Apr 25, 2025

federicotdn commented Apr 29, 2025

dlclark commented Apr 29, 2025 • edited Loading

dop251 commented Apr 29, 2025

federicotdn commented Apr 29, 2025

dop251 commented Apr 29, 2025

dlclark commented Apr 29, 2025

dlclark left a comment •

edited

Loading

dlclark commented Apr 24, 2025 •

edited

Loading

dlclark commented Apr 29, 2025 •

edited

Loading