Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Formally disambiguate the non-Annex-B grammar #1727

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gibson042
Copy link
Contributor

@gibson042 gibson042 commented Oct 8, 2019

@waldemarhorwat recently expressed objections to moving the Annex B regular expression syntax into the main grammar because of its order-dependent productions. However, I found an example of the same already in the main grammar, for surrogate pairs in codepoint-based "Unicode" regular expressions. This PR inserts a lookahead assertion to correct that ambiguity, and also applies the same treatment to replace some NumericLiteral prose with a formal assertion.

Closes #969.

@gibson042 gibson042 added editorial change needs consensus This needs committee consensus before it can be eligible to be merged. labels Oct 8, 2019
[+U] `u` TrailSurrogate
[+U] `u` NonSurrogate
[+U] RegExpUnicodeSurrogatePair
[+U] [lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegExpUnicodeSurrogatePair generates a set of terminal sequences, each of length 11. This doesn't fit with the current definition of lookahead-restrictions.

[+U] `u` TrailSurrogate
[+U] `u` NonSurrogate
[+U] RegExpUnicodeSurrogatePair
[+U] [lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookahead ∉ RegExpUnicodeSurrogatePair seems a little suspicious to me, since RegExpUnicodeSurrogatePair describes a set of sequences of terminals rather than just a set of terminals. As far as I know the only other place where there are sequences of length greater than 1 in a lookahead-restriction-set is the async function restriction in ExpressionStatement, and there the sequence is of length precisely two (and also it's written out more explicitly).

Could this instead be written as

[+U] `u` LeadSurrogate [lookahead ≠ `\u` TrailSurrogate]
[+U] `u` LeadSurrogate `\u` TrailSurrogate
[+U] `u` TrailSurrogate
[+U] `u` NonSurrogate

? That feels clearer to me, if it's equivalent. (And if it's not, I'm confused.)

Copy link
Contributor Author

@gibson042 gibson042 Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is invalid (see below), but even with refactoring would still be a lookahead of six code points. And it's not more clear to me, but I would be willing to switch to it if there's consensus.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in the ES grammars, a lookahead-constraint either:
(a) occurs at the end of a right-hand-side, or
(b) occurs before a nonterminal, where that nonterminal derives phrases that begin with the disallowed sequences (i.e. the constraint is 'scoped' to that nonterminal).

So the right-hand-side:

[lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits

would be quite unusual, in that the lookahead-constraint has to "look through" the terminal u and nonterminal HexDigits, and then look past them to a potential u Hex4Digits following. For this reason, I prefer @bakkot's suggestion:

LeadSurrogate [lookahead != `\u` TrailSurrogate]

as it eliminates the "look through" and winds up with a fairly standard end-of-RHS constraint. (But yes, the nature of its lookahead-sequence would require tweaking 5.1.5 Grammar Notation.)

Also, I think I'd prefer it to come after the Lead+Trail right-hand-side:

[+U] `u` LeadSurrogate `\u` TrailSurrogate
[+U] `u` LeadSurrogate [lookahead != `\u` TrailSurrogate]

(The other thing I like about this solution is that just those two lines make it really obvious why the lookahead-constraint is needed.)

@gibson042
Copy link
Contributor Author

gibson042 commented Oct 8, 2019

There are currently three flavors of negative lookahead assertions, all defined in Grammar Notation (emphasis mine):

  • [lookahead ∉ set], in which "set can be written as a comma separated list of one or two element terminal sequences enclosed in curly brackets"
  • [lookahead ∉ set], in which—"for convenience"—set can be "written as a nonterminal, in which case it represents the set of all terminals to which that nonterminal could expand"
  • [lookahead ≠ terminal]

So `u` LeadSurrogate [lookahead ≠ `\u` TrailSurrogate] would not be valid, because the ≠ notation is only allowed with a single terminal input element.

But addressing the more substantive point, I agree with @jmdyck that the intent is to bound lookahead, in particular limiting it to two terminals (corresponding with an LR(2) grammar). However, that is already not the case for two reasons. One of them has to do with the preservation of
LineTerminator input elements, such that a strict reading of section 5.1.2 requires unbounded lookahead for [lookahead ≠ `let [`] (because let and [ could be separated by an arbitrary amount of LineTerminator-replaced MultiLineComment sequences), and even a loose reading in which consecutive LineTerminator elements were collapsed would still require a lookahead of up to three elements (let, LineTerminator, [). And the other reason why lookahead is not actually bounded at two or even three terminals is the very section that I am changing... \uD834\uDF06 in a Unicode regular expression must be parsed as a single U+1D306 TETRAGRAM FOR CENTRE code point expanded from RegExpUnicodeEscapeSequence_U :: `u` LeadSurrogate `\u` TrailSurrogate rather than as a U+D834 code point expanded from RegExpUnicodeEscapeSequence_U :: `u` LeadSurrogate followed by a U+DF06 code point expanded from RegExpUnicodeEscapeSequence_U :: `u` TrailSurrogate, and the content enforcing that restriction is currently prose rather than formal lookahead semantics, even though an actual implementation is not permitted to recognize a RegExpUnicodeEscapeSequence_U :: `u` LeadSurrogate expansion without confirming that the following six code points of source text do not match `\u` TrailSurrogate—which is exactly equivalent to a lookahead assertion!

It is worth noting, but not necessarily compelling, that this applies to the Regular Expression grammar rather than to the syntactic grammar. Nevertheless, I believe that these particulars should be formally expressed where possible, even if that means admitting that ECMAScript is not as friendly to parse as it might otherwise appear to be (:roll_eyes:). I am willing to update the Grammar Notation section accordingly if that is the consensus. But an alternative does exist to expressing this requirement with a lookahead—introduction of a "longest expansion" rule analogous to the "longest input element" rule for lexical scanning. I'm not sure what exactly that would look like, but would be willing to try something out.

@jmdyck
Copy link
Collaborator

jmdyck commented Oct 8, 2019

But an alternative does exist to expressing this requirement with a lookahead—introduction of a "longest expansion" rule analogous to the "longest input element" rule for lexical scanning. I'm not sure what exactly that would look like, but would be willing to try something out.

I have a feeling that would be a bad idea: too much chance of unintended consequences. (Might depend on the exact formulation, though.) Currently, I think using a lookahead-restriction is the best option, along with any necessary changes to the Grammar Notation section.

@bakkot
Copy link
Contributor

bakkot commented Nov 28, 2019

My own preference, revisiting this, is to relax the length bound on lookahead restrictions from being bounded at 2 to being bounded at 4 or (if 4 does not suffice) 6, ideally with that relaxation scoped specifically to the RegExp grammar.

This might look like

In the RegExp grammar, the set can consist of nonempty sequences of terminals of length at most four. In the Syntactic grammar, the set can consist of nonempty sequences of terminals of length at most two. In the Lexical grammar, the set can consist of sequences of terminals of length exactly one. Lookaheads are not used in the Numeric String grammar. The set can be written as a comma separated list of one or two element terminal sequences enclosed in curly brackets. For convenience, the set can also be written as a nonterminal, in which case it represents the set of all sequences of terminals to which that nonterminal could expand.

(Additions in bold, removals ~struck through.) This also has the advantage of being explicit that the restriction on length of lookaheads is a property of all lookaheads, not just a particular way of writing them.

Thoughts?

@jmdyck
Copy link
Collaborator

jmdyck commented Nov 10, 2020

@bakkot: Those wording changes still wouldn't allow \u TrailSurrogate as a lookahead-sequence. (I.e., it isn't a comma-separated list of terminal sequences, or a nonterminal.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
editorial change needs consensus This needs committee consensus before it can be eligible to be merged.
3 participants