Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Eliminate order-disambiguation from Annex B Pattern-grammar #2445

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jmdyck
Copy link
Collaborator

@jmdyck jmdyck commented Jun 27, 2021

B.1.4 Regular Expressions Patterns says:

The syntax of 22.2.1 is modified and extended as follows. These changes introduce ambiguities that are broken by the ordering of grammar productions and by contextual information. When parsing using the following grammar, each alternative is considered only if previous production alternatives do not match.

This PR eliminates these order-dependencies (mostly by inserting equivalent lookahead-constraints).

(I made this a Draft PR because it isn't in a final (mergeable) state. On the other hand, it is ready for review, at least to the extent of deciding whether to pursue it.)


The basic idea is that an order-disambiguated production such as:

lhs ::
    alt1
    alt2
    alt3

can be transformed into an equivalent "normal" production:

lhs ::
    alt1
    [lookahead != alt1] alt2
    [lookahead != alt1] [lookahead != alt2] alt3

Of course, applied naively, this would be verbose and hard to read. (Some productions have 9 alternatives!) So instead, we only insert lookahead-constraints (or other exclusions) where an ambiguity actually exists.

Also, there's the risk that an alt might be grammatically more complex than we want to have in a lookahead-constraint. In practice, it looks like some are more complex than we currently allow, but perhaps not unreasonably so.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Jun 27, 2021

Analysis

B.1.4 says:

This alternative pattern grammar and semantics only changes the syntax and semantics of BMP patterns. The following grammar extensions include productions parameterized with the [U] parameter. However, none of these extensions change the syntax of Unicode patterns recognized when parsing with the [U] parameter present on the goal symbol.

That is, if we consider the B.1.4 grammar under [+U], it should duplicate the 22.2.1 grammar under [+U], which we can assume does not have ambiguities. Thus, we only need to look for ambiguities in the B.1.4 grammar under [~U].

This comment will go to each production in B.1.4, and:

  • look at the sentential forms generated (under [~U]) by each alternative (using an ad-hoc notation that combines BNF and regex),
  • identify the ambiguities (if any), and
  • suggest how they can be eliminated.

I'll consider the productions bottom-up (roughly the reverse of their order in the spec), so that our knowledge builds up as we go.


ClassControlLetter ::
  DecimalDigit
  `_`

forms:

  • alt1: [0-9]
  • alt2: _

There's no overlap, so no ambiguities.


SourceCharacterIdentityEscape[N] ::
  [~N] SourceCharacter but not `c`
  [+N] SourceCharacter but not one of `c` or `k`

For any given setting of [N], there's only one alternative, so there can't be an ambiguity.

Note that SourceCharacter is any Unicode code point, so below, I'll paraphrase SourceCharacterIdentityEscape as "any char except c and except k under [+N]"


IdentityEscape[U, N] ::
  [+U] SyntaxCharacter
  [+U] `/`
  [~U] SourceCharacterIdentityEscape[?N]

forms:

  • alt1: ignore
  • alt2: ignore
  • alt3: any char except c and except k under [+N]

Under [~U]:

  • there's only one alternative, so no ambiguities.
  • IdentityEscape has the same expansion as SourceCharacterIdentityEscape

CharacterEscape[U, N] ::
  ControlEscape
  `c` ControlLetter
  `0` [lookahead <! DecimalDigit]
  HexEscapeSequence
  RegExpUnicodeEscapeSequence[?U]
  [~U] LegacyOctalEscapeSequence
  IdentityEscape[?U, ?N]

forms:

  • alt1: [fnrtv]
  • alt2: c [a-zA-Z]
  • alt3: 0 [...]
  • alt4: x ...
  • alt5: u ...
  • alt6: [0-8] ...
  • alt7: any char except c and except k under [+N]

There's an ambiguity between alt3 and alt6, but it's exactly the same ambiguity as in EscapeSequence in PR #1867, and so will be handled by the approach there (i.e., by modifying LegacyOctalEscapeSequence to carve out the 0 [lookahead <! DecimalDigit] syntax).

The remaining ambiguities are between alt7 and everything else (except alt2).

Consider alt1 (ControlEscape). Clearly, if the current input begins with [fnrtv], this alternative will match, and alt7 will never get a chance. So we can resolve this ambiguity simply by excluding [fnrtv] from IdentityEscape. We could do this by appending "but not one of f or n or ..." to alt7, but because this is the only place (in the Annex B grammar) where IdentityEscape appears on the right-hand side, we can instead incorporate the exclusion into the definition of IdentityEscape (under [~U]). And from there, we can likewise push the exclusion down into the definition of SourceCharacterIdentityEscape.

Similarly, if the current input begins with [0-8], it's always the case that either alt3 or alt6 will match (though this isn't obvious), so we can resolve this ambiguity by excluding [0-8] from IdentityEscape.

However, this approach doesn't work for alt4 and alt5. If the current input begins with x, it might not match alt4 (HexEscapeSequence). And if it begins with u, it might not match alt5 (RegExpUnicodeEscapeSequence). In either case, under Annex B rules, alt7 will then be considered, and it will consume the x or u.

So IdentityEscape must still be free to recognize x and u, but only if the current input doesn't match HexEscapeSequence or RegExpUnicodeEscapeSequence.

  [lookahead <! HexEscapeSequence] [lookahead <! RegExpUnicodeEscapeSequence] IdentityEscape[?U, ?N]

Note that HexEscapeSequence derives a finite set of three-character sequences, and RegExpUnicodeEscapeSequence derives a rather large but still finite set of sequences of various lengths, so this is still within the definition of lookahead-constraints in 5.1.5 Grammar Notation.

Note also that, under [+U], these lookahead-constraints are satisfied automatically.

A quick summary for below: CharacterEscape derives forms that start with any char except k under [+N], and then maybe have more chars.


ClassEscape[U, N] ::
  `b`
  [+U] `-`
  [~U] `c` ClassControlLetter
  CharacterClassEscape[?U]
  CharacterEscape[?U, ?N]

forms:

  • alt1: b
  • alt2: ignore
  • alt3: c [0-9_]
  • alt4: [dswDSW]
  • alt5: (any char except k under [+N]) & maybe more chars

(Note that, for alt4, CharacterClassEscape only includes UnicodeProperty stuff under [+U], which we can ignore.)

alt1, alt3, alt4 are disjoint, so the ambiguities all involve alt5 (CharacterEscape)

alt3 and alt5 are disjoint, because alt3 only derives c [0-9_], and of the forms that alt5 derives that begin with c, there's only c [a-zA-Z].

alt5 has ambiguities with alt1 and alt4, which we could resolve with:

  [lookahead <! {`b`, `d`, `s`, `w`, `D`, `S`, `W`}] CharacterEscape[?U, ?N]

CharacterEscape appears in RHS of both ClassEscape and AtomEscape, so we can't push this exclusion into the definition of CharacterEscape unless the two uses agree. But looking ahead, we see that AtomEscape also has a CharacterClassEscape alt followed by a CharacterEscape alt, so they do agree on that exclusion. That is, alt5 here will be:

  [lookahead != `b`] CharacterEscape[?U, ?N]

and we can push the CharacterClassEscape [dswDSW] exclusion down into CharacterEscape, and thence into IdentityEscape, and thence into SourceCharacterIdentityEscape.

A quick summary for below: ClassEscape derives forms that start with any char except k under [+N], and then maybe have more chars.


ClassAtomNoDash[U, N] ::
  SourceCharacter but not one of `\` or `]` or `-`
  `\` ClassEscape[?U, ?N]
  `\` [lookahead == `c`]

forms:

  • alt1: any char except \ or ] or -
  • alt2: \ (any char except k under [+N]) (maybe more chars)
  • alt3: \ [lookahead == c]

alt1 can't be \, so can't conflict with alt2 or alt3.

But alt2 and alt3 conflict if the current input starts with \c. alt3 should only be considered if alt2 fails to match, i.e. if the input doesn't match c ClassControlLetter or c ControlLetter:

  `\` [lookahead == `c`] [lookahead <! `c` ClassControlLetter] [lookahead <! `c` ControlLetter]

This goes slightly outside what 5.1.5 Grammar Notation allows, in that c ClassControlLetter (likewise c ControlLetter) is neither an explicit set of token sequences, nor a single nonterminal. But the extension seems minor, as the phrase still expands to a finite set of token sequences.

(Alternatively, one could say:

  `\` [lookahead == `c`] [lookahead <! ClassEscape[?U, ?N]]

but I think that might be worse.)


AtomEscape[U, N] ::
  [+U] DecimalEscape
  [~U] DecimalEscape [> but only if the CapturingGroupNumber of |DecimalEscape| is &le; _NcapturingParens_]
  CharacterClassEscape[?U]
  CharacterEscape[?U, ?N]
  [+N] `k` GroupName[?U]

forms:

  • alt1: ignore
  • alt2: [1-9][0-9]* [> but only ...]
  • alt3: [dswDSW]
  • alt4: (any char except k under [+N]) && maybe more chars
  • alt5: [+N] k ...

alt2, alt3, alt5 are disjoint, so alt4 is the source of all ambiguities.

alt4 and alt5 are disjoint: the "except k under [+N]" that we've been carrying along since nearly the start finally pays off.

alt3 conflicts with alt4, but that's easy to deal with, and in fact the change to CharacterEscape proposed above under ClassEscape already took care of it.

alt2 also conflicts with alt4, when the current input starts with [1-9]. The resolution is (roughly) to prepend "[lookahead isnt alt2]" to alt4, but there are various ways it could be expressed. I wound up defining

ConstrainedDecimalEscape ::
  DecimalEscape [> but only if the CapturingGroupNumber of |DecimalEscape| is &le; _NcapturingParens_]

and then tweaking AtomEscape to

AtomEscape[U, N] ::
  [+U] DecimalEscape
  [~U] ConstrainedDecimalEscape
  CharacterClassEscape[?U]
  [+U] CharacterEscape[?U, ?N]
  [~U] [lookahead <! ConstrainedDecimalEscape] CharacterEscape[?U, ?N]
  [+N] `k` GroupName[?U]

(I split the CharacterEscape alternative into [+U] and [~U] versions because the reference to ConstrainedDecimalEscape wouldn't have made sense under [+U].)

Note that, while DecimalEscape derives an infinite set of terminal-sequences, the "but only if" limits it to a finite set, so [lookahead <! ConstrainedDecimalEscape] seems to be a valid lookahead-constraint.

A quick summary for below: AtomEscape derives forms that start with any char, and then maybe have more chars.


ExtendedPatternCharacter ::
  SourceCharacter but not one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `|`

Only one alternative, so no ambiguities.


InvalidBracedQuantifier ::
  `{` DecimalDigits[~Sep] `}`
  `{` DecimalDigits[~Sep] `,` `}`
  `{` DecimalDigits[~Sep] `,` DecimalDigits[~Sep] `}`

The three alternatives are disjoint, so no ambiguities.


ExtendedAtom[N] ::
  `.`
  `\` AtomEscape[~U, ?N]
  `\` [lookahead == `c`]
  CharacterClass[~U]
  `(` Disjunction[~U, ?N] `)`
  `(` `?` `:` Disjunction[~U, ?N] `)`
  InvalidBracedQuantifier
  ExtendedPatternCharacter

forms:

  • alt1: .
  • alt2: \ (any char) && (maybe more chars)
  • alt3: \ [lookahead == c]
  • alt4: [ ...
  • alt5: ( [^?] ...
  • alt6: (?: ...
  • alt7: { ...
  • alt8: any char except ^ $ \ . * + ? ( ) [ |

alt2 conflicts with alt3 on \c. AtomEscape derives forms starting with c via

CharacterEscape :: `c` ControlLetter

so we can resolve this by changing alt3 to:

  `\` [lookahead == `c`] [lookahead <! `c` ControlLetter]

(similar to the change in ClassAtomNoDash).

alt7 conflicts with alt8 on {. To resolve, we could change alt8:

  [lookahead <! InvalidBracedQuantifier] ExtendedPatternCharacter

The problem with this is that InvalidBracedQuantifier derives an infinite set of terminal-sequences (because its DecimalDigits can be arbitrarily long). However, checking whether the lookahead matches InvalidBracedQuantifier doesn't seem unreasonable, so perhaps we can relax 5.1.5's restriction on the constraint-set from finite set to regular set.

A quick summary for below: ExtendedAtom derives forms that start with (any char except ^ $ * + ? ) |) && maybe have more


QuantifiableAssertion[N] ::
  `(` `?` `=` Disjunction[~U, ?N] `)`
  `(` `?` `!` Disjunction[~U, ?N] `)`

Alternatives are disjoint, so no ambiguities.


Assertion[U, N] ::
  `^`
  `$`
  `\` `b`
  `\` `B`
  [+U] `(` `?` `=` Disjunction[+U, ?N] `)`
  [+U] `(` `?` `!` Disjunction[+U, ?N] `)`
  [~U] QuantifiableAssertion[?N]
  `(` `?` `<=` Disjunction[?U, ?N] `)`
  `(` `?` `<!` Disjunction[?U, ?N] `)`

forms:

  • alt1: ^
  • alt2: $
  • alt3: \b
  • alt4: \B
  • alt5: ignore
  • alt6: ignore
  • alt7: (?[=!] ...
  • alt8: (?<= ...
  • alt9: (?<! ...

Alternatives are disjoint, so no ambiguities.


Term[U, N] ::
  [+U] Assertion[+U, ?N]
  [+U] Atom[+U, ?N] Quantifier
  [+U] Atom[+U, ?N]
  [~U] QuantifiableAssertion[?N] Quantifier
  [~U] Assertion[~U, ?N]
  [~U] ExtendedAtom[?N] Quantifier
  [~U] ExtendedAtom[?N]

forms:

  • alt1: ignore
  • alt2: ignore
  • alt3: ignore
  • alt4: (?[=!] ...
  • alt5: ^ | $ | \[bB] | (?[=!] ... | (?<[=!] ...
  • alt6: (any char except ^ $ * + ? ) |) ...
  • alt7: (any char except ^ $ * + ? ) |) ...

Consider alt6 and alt7:

  [~U] ExtendedAtom[?N] Quantifier
  [~U] ExtendedAtom[?N]

There's clearly a shift-reduce conflict after ExtendedAtom, but is there an ambiguity? If the text matching Quantifier is (or starts with) [*+?], there's no ambiguity, because (in this context) that can only be a Quantifier. But if the text matching Quantifier starts with '{', then formally there's an ambiguity, e.g. x{2} can be parsed as

Alternative --- Term -+- ExtendedAtom - x
                      |
                      +- Quantifier - { 2 }

or

Alternative -+- Term --- ExtendedAtom - x
             |
             +- Term --- ExtendedAtom --- InvalidBracedQuantifier - { 2 }

But any occurrence of InvalidBracedQuantifier is defined (by Early Error rule) to be a Syntax Error, so I'm guessing that this doesn't count as an ambiguity as far as the spec is concerned. Therefore, we don't need to disambiguate between alt6 and alt7.

Simlarly, alt4 conflicts with alt5 on QuantifiableAssertion followed by Quantifier (or not), but I'm assuming the spec doesn't consider it an ambiguity.

It looks like alt4 might conflict with alt6/alt7, but no, ExtendedAtom can't start with (?[=!].

Lastly, alt5 conflicts with alt6/alt7 on \[bB]. Since ExtendedAtom's only RHS appearances are here in alt6+alt7, we can resolve the ambiguity by modifying ExtendedAtom's definition to exclude \[bB]. That is, we can change its alt2:

  `\` AtomEscape[~U, ?N]

to

  `\` [lookahead <! {`b`, `B`}] AtomEscape[~U, ?N]

(We could push the exclusion down into AtomEscape[~U]:

  [~U] ... [lookahead <! {`b`, `B`}] CharacterEscape[?U, ?N]

but I'm not sure that would be an improvement.)

@ljharb

This comment has been minimized.

@jmdyck

This comment has been minimized.

@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch 3 times, most recently from a86233c to f773ee8 Compare June 28, 2021 14:05
@ljharb ljharb force-pushed the master branch 3 times, most recently from 3d0c24c to 7a79833 Compare June 29, 2021 02:21
@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 17, 2021

(force-pushed to rebase to master + resolve merge conflicts from #2411)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants