-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: Eliminate order-disambiguation from Annex B Pattern-grammar #2445
base: main
Are you sure you want to change the base?
Conversation
AnalysisB.1.4 says:
That is, if we consider the B.1.4 grammar under [+U], it should duplicate the 22.2.1 grammar under [+U], which we can assume does not have ambiguities. Thus, we only need to look for ambiguities in the B.1.4 grammar under [~U]. This comment will go to each production in B.1.4, and:
I'll consider the productions bottom-up (roughly the reverse of their order in the spec), so that our knowledge builds up as we go.
forms:
There's no overlap, so no ambiguities.
For any given setting of [N], there's only one alternative, so there can't be an ambiguity. Note that SourceCharacter is any Unicode code point, so below, I'll paraphrase SourceCharacterIdentityEscape as "any char except
forms:
Under [~U]:
forms:
There's an ambiguity between alt3 and alt6, but it's exactly the same ambiguity as in EscapeSequence in PR #1867, and so will be handled by the approach there (i.e., by modifying LegacyOctalEscapeSequence to carve out the The remaining ambiguities are between alt7 and everything else (except alt2). Consider alt1 (ControlEscape). Clearly, if the current input begins with [fnrtv], this alternative will match, and alt7 will never get a chance. So we can resolve this ambiguity simply by excluding [fnrtv] from IdentityEscape. We could do this by appending "but not one of Similarly, if the current input begins with [0-8], it's always the case that either alt3 or alt6 will match (though this isn't obvious), so we can resolve this ambiguity by excluding [0-8] from IdentityEscape. However, this approach doesn't work for alt4 and alt5. If the current input begins with So IdentityEscape must still be free to recognize
Note that HexEscapeSequence derives a finite set of three-character sequences, and RegExpUnicodeEscapeSequence derives a rather large but still finite set of sequences of various lengths, so this is still within the definition of lookahead-constraints in 5.1.5 Grammar Notation. Note also that, under [+U], these lookahead-constraints are satisfied automatically. A quick summary for below: CharacterEscape derives forms that start with any char except
forms:
(Note that, for alt4, CharacterClassEscape only includes UnicodeProperty stuff under [+U], which we can ignore.) alt1, alt3, alt4 are disjoint, so the ambiguities all involve alt5 (CharacterEscape) alt3 and alt5 are disjoint, because alt3 only derives alt5 has ambiguities with alt1 and alt4, which we could resolve with:
CharacterEscape appears in RHS of both ClassEscape and AtomEscape, so we can't push this exclusion into the definition of CharacterEscape unless the two uses agree. But looking ahead, we see that AtomEscape also has a CharacterClassEscape alt followed by a CharacterEscape alt, so they do agree on that exclusion. That is, alt5 here will be:
and we can push the CharacterClassEscape [dswDSW] exclusion down into CharacterEscape, and thence into IdentityEscape, and thence into SourceCharacterIdentityEscape. A quick summary for below: ClassEscape derives forms that start with any char except
forms:
alt1 can't be But alt2 and alt3 conflict if the current input starts with
This goes slightly outside what 5.1.5 Grammar Notation allows, in that (Alternatively, one could say:
but I think that might be worse.)
forms:
alt2, alt3, alt5 are disjoint, so alt4 is the source of all ambiguities. alt4 and alt5 are disjoint: the "except alt3 conflicts with alt4, but that's easy to deal with, and in fact the change to CharacterEscape proposed above under ClassEscape already took care of it. alt2 also conflicts with alt4, when the current input starts with [1-9]. The resolution is (roughly) to prepend "[lookahead isnt alt2]" to alt4, but there are various ways it could be expressed. I wound up defining
and then tweaking AtomEscape to
(I split the CharacterEscape alternative into [+U] and [~U] versions because the reference to ConstrainedDecimalEscape wouldn't have made sense under [+U].) Note that, while DecimalEscape derives an infinite set of terminal-sequences, the "but only if" limits it to a finite set, so A quick summary for below: AtomEscape derives forms that start with any char, and then maybe have more chars.
Only one alternative, so no ambiguities.
The three alternatives are disjoint, so no ambiguities.
forms:
alt2 conflicts with alt3 on
so we can resolve this by changing alt3 to:
(similar to the change in ClassAtomNoDash). alt7 conflicts with alt8 on
The problem with this is that InvalidBracedQuantifier derives an infinite set of terminal-sequences (because its DecimalDigits can be arbitrarily long). However, checking whether the lookahead matches InvalidBracedQuantifier doesn't seem unreasonable, so perhaps we can relax 5.1.5's restriction on the constraint-set from finite set to regular set. A quick summary for below: ExtendedAtom derives forms that start with (any char except
Alternatives are disjoint, so no ambiguities.
forms:
Alternatives are disjoint, so no ambiguities.
forms:
Consider alt6 and alt7:
There's clearly a shift-reduce conflict after ExtendedAtom, but is there an ambiguity? If the text matching Quantifier is (or starts with) [*+?], there's no ambiguity, because (in this context) that can only be a Quantifier. But if the text matching Quantifier starts with '{', then formally there's an ambiguity, e.g.
or
But any occurrence of InvalidBracedQuantifier is defined (by Early Error rule) to be a Syntax Error, so I'm guessing that this doesn't count as an ambiguity as far as the spec is concerned. Therefore, we don't need to disambiguate between alt6 and alt7. Simlarly, alt4 conflicts with alt5 on QuantifiableAssertion followed by Quantifier (or not), but I'm assuming the spec doesn't consider it an ambiguity. It looks like alt4 might conflict with alt6/alt7, but no, ExtendedAtom can't start with Lastly, alt5 conflicts with alt6/alt7 on
to
(We could push the exclusion down into AtomEscape[~U]:
but I'm not sure that would be an improvement.) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
a86233c
to
f773ee8
Compare
3d0c24c
to
7a79833
Compare
f773ee8
to
32d59c4
Compare
(force-pushed to rebase to master + resolve merge conflicts from #2411) |
32d59c4
to
628be4a
Compare
628be4a
to
6425763
Compare
6425763
to
05c15a5
Compare
05c15a5
to
4403239
Compare
7d4b5e0
to
972db12
Compare
972db12
to
3d0639d
Compare
3d0639d
to
a5ae275
Compare
B.1.4 Regular Expressions Patterns says:
This PR eliminates these order-dependencies (mostly by inserting equivalent lookahead-constraints).
(I made this a Draft PR because it isn't in a final (mergeable) state. On the other hand, it is ready for review, at least to the extent of deciding whether to pursue it.)
The basic idea is that an order-disambiguated production such as:
can be transformed into an equivalent "normal" production:
Of course, applied naively, this would be verbose and hard to read. (Some productions have 9 alternatives!) So instead, we only insert lookahead-constraints (or other exclusions) where an ambiguity actually exists.
Also, there's the risk that an alt might be grammatically more complex than we want to have in a lookahead-constraint. In practice, it looks like some are more complex than we currently allow, but perhaps not unreasonably so.