The change set w.r.t the previous edition can be seen here. The discussion before merge can be seen here. Some more discussions can be found in the original google doc.
This edition incorporates two proposals:
- « Support properties of strings (a.k.a. “sequence properties”) in Unicode property escapes »
- « RegExp v flag with set notation + properties of strings ».
The latter technically subsumes the former, but these can be seen as two separate additions:
- String properties;
- Set notations.
The biggest impact of this feature is the change of definition of CharSet
.
Whereas previously a character class was being compiled to a set of character (hence the name CharSet
),
it can now be compiled to a set of strings (or sequence of characters), depending on the presence of v
.
This change could prove problematic, since the operations available on CharSetElement
differ wildly depending on the flags.
A solution might be to abstract the logic manipulating CharSet
s in CompileAtom
, like we handled Canonicalize
.
These new features should require minimal changes. Character classes should be changed to accommodate nesting. Most of the work to implement the new operators can be delegated to operations on sets.
Note that the table numbers have been shifted.
The changes here will cause some irrelevant changes in the subsequent sections, as some variables will be renamed accordingly.
- Safeguard preventing string properties if
v
is not toggled; - Negated character classes (
[^...]
,\P{...}
) are rejected.
- New rules for special characters/escapes in
ClassSetCharacter
New analysis: detects whether a character class might match strings, i.e. whether it might consume more than one character from the input.
CaptureRange
andMatchState
are now said to be record rather than tuples; note that they were already being used (and were mechanized) as if they were records. Note that this make the diff look more bigger than it actually is.CharSetElement
is now defined "dependently" of thev
flag as either aCharacter
(as before) or alist Character
.
Add v
flag to record.
Modularize Disjunction
, Empty
, Sequence
matchers into functions (sections 22.2.2.3.[2-4])
AllCharacters
is now a functionAtom :: CharacterClass
andAtomEscape :: CharacterClassEscape
become way more complex, mostly in order to implement the longest-match semantics of string properties. The behavior being wildly different depending on the flag (which is amplified by the wide difference on the definition ofCharSetElement
) make these functions candidates for abstraction. Additionally, the two implement the exact same logic. The differences are concentrated in the very first steps (-
isAtom :: CharacterClass
and+
AtomEscape :: CharacterClassEscape
):+ 1. Let cs be CompileToCharSet of CharacterClassEscape with argument rer. - 1. Let cc be CompileCharacterClass of CharacterClass with argument rer. - 2. Let cs be cc.[[CharSet]]. + 2. If rer.[[UnicodeSets]] is false, or if every CharSetElement of cs consists of a single character (including if cs is empty), return CharacterSetMatcher(rer, cs, false, direction). - 3. If rer.[[UnicodeSets]] is false, or if every CharSetElement of cs consists of a single character (including if cs is empty), return CharacterSetMatcher(rer, cs, cc.[[Invert]], direction). - 4. Assert: cc.[[Invert]] is false.
New assertions at the beginning if v
; should hold by new early errors.
Unicode mode check changed to use the new HasEitherUnicodeFlag
function.
Use CharacterComplement
with v
flag.
- Use
CharacterComplement
. - Use
MaybeSimpleCaseFolding
. - Implement new set operators.
- Implement new character classes.
New function.
Unicode mode check changed to use the new HasEitherUnicodeFlag
function.
New functions.
Add lookup into string properties table.
New function.
Not covered.
- What happens when
v
is enabled but notu
(yes, it's valid; see this issue)? - The fact that the the
MayContainStrings
analysis is done before compiling the set forces it to be extremely pessimistic, e.g.[\p{emoji}--\p{emoji}]
is analyzed as potentially containing strings, despite being empty. - In positive lookarounds: from
Let y be r's MatchState.
toAssert: r is a MatchState.
. Makes more sense from spec perspective, but the previous wording was better for us. AllCharacters
has a special case forvi
flags; why? Seems like the answer might be in this discussion.