Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: nested surrogate pairs? #969

Open
jmdyck opened this issue Aug 6, 2017 · 7 comments · May be fixed by #1727
Open

Editorial: nested surrogate pairs? #969

jmdyck opened this issue Aug 6, 2017 · 7 comments · May be fixed by #1727

Comments

@jmdyck
Copy link
Collaborator

jmdyck commented Aug 6, 2017

In the 'Patterns' clause, the Syntax section contains the paragraph:

Each \u TrailSurrogate for which the choice of associated u LeadSurrogate is ambiguous shall be associated with the nearest possible u LeadSurrogate that would otherwise have no corresponding \u TrailSurrogate.

This strikes me as very odd. The wording is exactly parallel to that for resolving the dangling else problem, suggesting that surrogate pairs nest somehow. I.e., given

\u LeadSurrogate \u LeadSurrogate \u TrailSurrogate \u TrailSurrogate

the last TrailSurrogate should be 'associated' with the first LeadSurrogate. But what would that even mean? (Note that the relevant semantics make no mention of "associated" or "corresponding" Surrogates.)

Now, granted, the grammar is formally ambiguous here (Alternative[+U] derives \u LeadSurrogate \u TrailSurrogate in two distinct ways), and so requires some disambiguation. But rather than the quoted paragraph, I'd prefer one of these approaches:

  • If some part of the source text matches \u LeadSurrogate \u TrailSurrogate, it must be parsed as a single Atom rather than two.
  • In cases of ambiguity, RegExpUnicodeEscapeSequence's first RHS is preferred over its second and third.
  • Change the second RHS to [+U] u LeadSurrogate [lookahead != \u TrailSurrogate]

(One objection to the last approach is that TrailSurrogate is not a terminal symbol, and so this is not a 'legal' lookahead sequence. However, this doesn't bother me much: we could enlarge the definition of lookahead sequences if we wanted.)

@ljharb
Copy link
Member

ljharb commented Jan 3, 2020

@mathiasbynens
Copy link
Member

mathiasbynens commented Jan 3, 2020

I agree the current paragraph is oddly phrased. I don't understand why it's written like that. Maybe @NorbertLindenberg or @allenwb remember?

If we're gonna change this, your first suggested approach seems simplest IMHO:

If some part of the source text matches \u LeadSurrogate \u TrailSurrogate, it must be parsed as a single Atom rather than two.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Jan 3, 2020

Note that PR #1727 also addresses this issue.

@NorbertLindenberg
Copy link

I didn’t write this – it first showed up in the spec draft of 2015-04-03. The change was preceded by a discussion on “lonely surrogates and unicode regexps” starting here:
https://mail.mozilla.org/pipermail/es-discuss/2015-January/041281.html

@allenwb
Copy link
Member

allenwb commented Jan 3, 2020

I wrote it because without it (or something like it) the grammar is ambiguous. The reason the phrasing parallels the language used for "dangling else" clause is because at the grammar engineering level this is the same kind of ambiguity and copying that language seemed like a safe choice to make given this was happening in April 2015 immediately before release of the final ES6 draft. There wasn't a lot of time to study alternatives.

Note that the disambiguation really only concerns the first three alternative of

RegExpUnicodeEscapeSequence ::
u LeadSurrogate \u TrailSurrogate
u LeadSurrogate
u TailSurrogate
...

because that is where the ambiguity exists. The disambiguation statement does not lead to recognizing "nested surrogate pairs" in situations such as such as:

\u LeadSurrogate \u LeadSurrogate \u TrailSurrogate \u TrailSurrogate

  1. In processing the first lead surrogate there is no ambiguity between choosing between alternative 1 and alternative 2 because because the lead surrogate is not immediately followed by a tail surrogate. So, alternative 1 is matched.
  2. In processing the second surrogate there is an ambiguity. Either alternative 1 or alternative 2 could be used. So, the disambiguation rule applies and forces use of alternative 1, consuming the first tail surrogate.
  3. Finally, the second tail surrogate is processed unambiguously matched using alternative 3.

So, no "nesting" is recognized.

But, I agree that the current wording of the disambiguation statement isn't ideal for this situation. My suggestion for a replacement (I excluded the grammar parameters but the actual spec text should include them):

For ambiguous inputs where either the rule:
RegExpUnicodeEscapeSequence :: u LeadSurrogate \u TrailSurrogate
or the rule:
RegExpUnicodeEscapeSequence :: u LeadSurrogate
could be recognized the first rule always is the one that is recognized.

@gabezal
Copy link

gabezal commented Oct 13, 2020

There's another problem with that paragraph: it makes it sound like there can be any number of non-surrogate, and even non-\u characters between the lead and trail surrogates, which I don't think should be allowed.

@allenwb
Copy link
Member

allenwb commented Oct 13, 2020

Attached is the email thread that led to the addition of the paragraph in question.
Note that it had nothing to do with nesting of surrogate pairs. Its only intent was to disambiguate the grammar.

Re ES6 grammar abnormalities.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants