Editorial: Formally disambiguate the non-Annex-B grammar #1727

gibson042 · 2019-10-08T01:11:31Z

@waldemarhorwat recently expressed objections to moving the Annex B regular expression syntax into the main grammar because of its order-dependent productions. However, I found an example of the same already in the main grammar, for surrogate pairs in codepoint-based "Unicode" regular expressions. This PR inserts a lookahead assertion to correct that ambiguity, and also applies the same treatment to replace some NumericLiteral prose with a formal assertion.

Closes #969.

…r expressions

…sertions

jmdyck · 2019-10-08T17:35:52Z

spec.html

-          [+U] `u` TrailSurrogate
-          [+U] `u` NonSurrogate
+          [+U] RegExpUnicodeSurrogatePair
+          [+U] [lookahead &notin; RegExpUnicodeSurrogatePair] `u` Hex4Digits


RegExpUnicodeSurrogatePair generates a set of terminal sequences, each of length 11. This doesn't fit with the current definition of lookahead-restrictions.

bakkot · 2019-10-08T17:42:17Z

spec.html

-          [+U] `u` TrailSurrogate
-          [+U] `u` NonSurrogate
+          [+U] RegExpUnicodeSurrogatePair
+          [+U] [lookahead &notin; RegExpUnicodeSurrogatePair] `u` Hex4Digits


lookahead ∉ RegExpUnicodeSurrogatePair seems a little suspicious to me, since RegExpUnicodeSurrogatePair describes a set of sequences of terminals rather than just a set of terminals. As far as I know the only other place where there are sequences of length greater than 1 in a lookahead-restriction-set is the async function restriction in ExpressionStatement, and there the sequence is of length precisely two (and also it's written out more explicitly).

Could this instead be written as

[+U] `u` LeadSurrogate [lookahead ≠ `\u` TrailSurrogate] [+U] `u` LeadSurrogate `\u` TrailSurrogate [+U] `u` TrailSurrogate [+U] `u` NonSurrogate

? That feels clearer to me, if it's equivalent. (And if it's not, I'm confused.)

That is invalid (see below), but even with refactoring would still be a lookahead of six code points. And it's not more clear to me, but I would be willing to switch to it if there's consensus.

Currently, in the ES grammars, a lookahead-constraint either:
(a) occurs at the end of a right-hand-side, or
(b) occurs before a nonterminal, where that nonterminal derives phrases that begin with the disallowed sequences (i.e. the constraint is 'scoped' to that nonterminal).

So the right-hand-side:

[lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits

would be quite unusual, in that the lookahead-constraint has to "look through" the terminal u and nonterminal HexDigits, and then look past them to a potential u Hex4Digits following. For this reason, I prefer @bakkot's suggestion:

LeadSurrogate [lookahead != `\u` TrailSurrogate]

as it eliminates the "look through" and winds up with a fairly standard end-of-RHS constraint. (But yes, the nature of its lookahead-sequence would require tweaking 5.1.5 Grammar Notation.)

Also, I think I'd prefer it to come after the Lead+Trail right-hand-side:

[+U] `u` LeadSurrogate `\u` TrailSurrogate [+U] `u` LeadSurrogate [lookahead != `\u` TrailSurrogate]

(The other thing I like about this solution is that just those two lines make it really obvious why the lookahead-constraint is needed.)

gibson042 · 2019-10-08T19:15:41Z

There are currently three flavors of negative lookahead assertions, all defined in Grammar Notation (emphasis mine):

[lookahead ∉ set], in which "set can be written as a comma separated list of one or two element terminal sequences enclosed in curly brackets"
[lookahead ∉ set], in which—"for convenience"—set can be "written as a nonterminal, in which case it represents the set of all terminals to which that nonterminal could expand"
[lookahead ≠ terminal]

So `u` LeadSurrogate [lookahead ≠ `\u` TrailSurrogate] would not be valid, because the ≠ notation is only allowed with a single terminal input element.

But addressing the more substantive point, I agree with @jmdyck that the intent is to bound lookahead, in particular limiting it to two terminals (corresponding with an LR(2) grammar). However, that is already not the case for two reasons. One of them has to do with the preservation of
LineTerminator input elements, such that a strict reading of section 5.1.2 requires unbounded lookahead for [lookahead ≠ `let [`] (because let and [ could be separated by an arbitrary amount of LineTerminator-replaced MultiLineComment sequences), and even a loose reading in which consecutive LineTerminator elements were collapsed would still require a lookahead of up to three elements (let, LineTerminator, [). And the other reason why lookahead is not actually bounded at two or even three terminals is the very section that I am changing... \uD834\uDF06 in a Unicode regular expression must be parsed as a single U+1D306 TETRAGRAM FOR CENTRE code point expanded from RegExpUnicodeEscapeSequence_U :: `u` LeadSurrogate `\u` TrailSurrogate rather than as a U+D834 code point expanded from RegExpUnicodeEscapeSequence_U :: `u` LeadSurrogate followed by a U+DF06 code point expanded from RegExpUnicodeEscapeSequence_U :: `u` TrailSurrogate, and the content enforcing that restriction is currently prose rather than formal lookahead semantics, even though an actual implementation is not permitted to recognize a RegExpUnicodeEscapeSequence_U :: `u` LeadSurrogate expansion without confirming that the following six code points of source text do not match `\u` TrailSurrogate—which is exactly equivalent to a lookahead assertion!

It is worth noting, but not necessarily compelling, that this applies to the Regular Expression grammar rather than to the syntactic grammar. Nevertheless, I believe that these particulars should be formally expressed where possible, even if that means admitting that ECMAScript is not as friendly to parse as it might otherwise appear to be (:roll_eyes:). I am willing to update the Grammar Notation section accordingly if that is the consensus. But an alternative does exist to expressing this requirement with a lookahead—introduction of a "longest expansion" rule analogous to the "longest input element" rule for lexical scanning. I'm not sure what exactly that would look like, but would be willing to try something out.

jmdyck · 2019-10-08T20:43:42Z

But an alternative does exist to expressing this requirement with a lookahead—introduction of a "longest expansion" rule analogous to the "longest input element" rule for lexical scanning. I'm not sure what exactly that would look like, but would be willing to try something out.

I have a feeling that would be a bad idea: too much chance of unintended consequences. (Might depend on the exact formulation, though.) Currently, I think using a lookahead-restriction is the best option, along with any necessary changes to the Grammar Notation section.

bakkot · 2019-11-28T18:02:40Z

My own preference, revisiting this, is to relax the length bound on lookahead restrictions from being bounded at 2 to being bounded at 4 or (if 4 does not suffice) 6, ideally with that relaxation scoped specifically to the RegExp grammar.

This might look like

In the RegExp grammar, the set can consist of nonempty sequences of terminals of length at most four. In the Syntactic grammar, the set can consist of nonempty sequences of terminals of length at most two. In the Lexical grammar, the set can consist of sequences of terminals of length exactly one. Lookaheads are not used in the Numeric String grammar. The set can be written as a comma separated list of ~~one or two element~~ terminal sequences enclosed in curly brackets. For convenience, the set can also be written as a nonterminal, in which case it represents the set of all sequences of terminals to which that nonterminal could expand.

(Additions in bold, removals ~struck through.) This also has the advantage of being explicit that the restriction on length of lookaheads is a property of all lookaheads, not just a particular way of writing them.

Thoughts?

jmdyck · 2020-11-10T17:18:47Z

@bakkot: Those wording changes still wouldn't allow \u TrailSurrogate as a lookahead-sequence. (I.e., it isn't a comma-separated list of terminal sequences, or a nonterminal.)

gibson042 added 2 commits October 7, 2019 20:17

Editorial: Formalize consumption of surrogate pairs in Unicode regula…

da244ab

…r expressions

Editorial: Replace post-NumericLiteral lookahead prose with formal as…

a222ba6

…sertions

gibson042 added editorial change needs consensus This needs committee consensus before it can be eligible to be merged. labels Oct 8, 2019

gibson042 mentioned this pull request Oct 8, 2019

Editorial: Reference DecimalDigit rather than duplicating its RHS #1728

Merged

jmdyck reviewed Oct 8, 2019

View reviewed changes

bakkot reviewed Oct 8, 2019

View reviewed changes

ljharb requested a review from waldemarhorwat October 8, 2019 20:30

jmdyck mentioned this pull request Jan 3, 2020

Editorial: nested surrogate pairs? #969

Open

bakkot mentioned this pull request Dec 13, 2020

Editorial: reword definition of lookahead restrictions #2254

Merged

ljharb force-pushed the master branch 3 times, most recently from 3d0c24c to 7a79833 Compare June 29, 2021 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editorial: Formally disambiguate the non-Annex-B grammar #1727

Editorial: Formally disambiguate the non-Annex-B grammar #1727

gibson042 commented Oct 8, 2019 •

edited by ljharb

Loading

jmdyck Oct 8, 2019

bakkot Oct 8, 2019

gibson042 Oct 8, 2019 •

edited

Loading

jmdyck Nov 10, 2020

gibson042 commented Oct 8, 2019 •

edited

Loading

jmdyck commented Oct 8, 2019

bakkot commented Nov 28, 2019

jmdyck commented Nov 10, 2020

Editorial: Formally disambiguate the non-Annex-B grammar #1727

Are you sure you want to change the base?

Editorial: Formally disambiguate the non-Annex-B grammar #1727

Conversation

gibson042 commented Oct 8, 2019 • edited by ljharb Loading

jmdyck Oct 8, 2019

Choose a reason for hiding this comment

bakkot Oct 8, 2019

Choose a reason for hiding this comment

gibson042 Oct 8, 2019 • edited Loading

Choose a reason for hiding this comment

jmdyck Nov 10, 2020

Choose a reason for hiding this comment

gibson042 commented Oct 8, 2019 • edited Loading

jmdyck commented Oct 8, 2019

bakkot commented Nov 28, 2019

jmdyck commented Nov 10, 2020

gibson042 commented Oct 8, 2019 •

edited by ljharb

Loading

gibson042 Oct 8, 2019 •

edited

Loading

gibson042 commented Oct 8, 2019 •

edited

Loading