Despite the main design choices of this project being to stick as closely as possible to the spec, some divergences were to be expected. We document some of them here.
The mechanization does not include a parser for regexes. One of the main reasons being that the specification provides two different grammars for regexes:
The situation surrounding these two grammars is a bit foggy:
- The specification states that
The ECMAScript language syntax and semantics defined in this annex are required when the ECMAScript host is a web browser. The content of this annex is normative but optional if the ECMAScript host is not a web browser.
Note This annex describes various legacy features and other characteristics of web browser ECMAScript hosts. All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification. However, the usage of these features by large numbers of existing web pages means that web browsers must continue to support them. The specifications in this annex define the requirements for interoperable implementations of these legacy features.
These features are not considered part of the core ECMAScript language. Programmers should not use or assume the existence of these features and behaviours when writing new ECMAScript code. ECMAScript implementations are discouraged from implementing these features unless the implementation is part of a web browser or is required to run the same legacy ECMAScript code that web browsers encounter.
- Yet, from our experience, most engines (whether they are web browsers or not, as notably demonstrated by Node.js) accept the legacy syntax.
Hence, it is unclear which grammar should be mechanized.
The datatype representing (ASTs of) regexes in the mechanization is "flat", in the sense that all regexes features are represented using that single type. This means that features can be nested without any constraints, which should allow to represent any regex generated by any of the two grammars, whereas these grammars sometimes prevent some nesting of features.
Example: the main grammar does not allow the following regex
/(?=a)*/
because quantifiers (such as
*
) cannot be applied to assertions (such as(?=...)
). Note however that the following regex is allowed/(?:(?=a))*/
Put another way: the choice was made not to distinguish between alternatives, terms, assertions, atoms, etc.
The specification prevents the construction of some regexes for parsing reasons.
Example: a character literal in a regex cannot be one of the "SyntaxCharacter"; from the spec
22.2.1 Patterns
Atom[UnicodeMode, N] :: PatternCharacter [...] SyntaxCharacter :: one of ^ $ \ . * + ? ( ) [ ] { } | PatternCharacter :: SourceCharacter but not SyntaxCharacter
Since these restrictions are there purely for parsing reasons, these are not enforced in the datatype used to represent regexes.
For similar reasons, the specification sometimes defines variants of a constructor.
Such an example are ClassAtom
and ClassAtomNoDash
.
Example: from the specification
22.2.1 Patterns
[...] ClassAtom[UnicodeMode] :: - ClassAtomNoDash[?UnicodeMode] ClassAtomNoDash[UnicodeMode] :: SourceCharacter but not one of \ or ] or - \ ClassEscape[?UnicodeMode] [...]
These are typically collapsed into one single constructor in the mechanization.
The specification relies on operations which could technically fail. The most obvious examples being
- Assertions, e.g.
Assert: 0 ≤ index ≤ the number of elements in Input.
; - List accesses, e.g.
Let ch be the character Input[index].
.
Assertions are assumed to hold, as stated by the specification itself
A step that begins with “Assert:” asserts an invariant condition of its algorithm. Such assertions are used to make explicit algorithmic invariants that would otherwise be implicit. Such assertions add no additional semantic requirements and hence need not be checked by an implementation. They are used simply to clarify algorithms.
List accesses (and other fallible operations) can be interpreted in the same way, i.e. their usage in the specification is an assertion that they do succeed.
Rather than positing their success, the mechanization accounts for potential failure by using the error monad. Proofs are then conducted to show formally that these assumptions are indeed correct.
The specification contains a few assertions which ensure that numbers can be safely represented in JavaScript.
Some examples:
It is a Syntax Error if CountLeftCapturingParensWithin(Pattern) ≥ 2^32 - 1.
Assert: n < 2^32 - 1.
These checks are currently not mechanized, as unbound integers are used in Coq, OCaml and JavaScript.
Some functions in the specification, RepeatMatcher
being the the most infamous one, are generally recursive.
Such definitions would be rejected by Coq, as it cannot syntactically ensure their termination.
We modified such definitions into fuel-based functions.
This transformation allows Coq to ensure their termination,
since the function is then structurally recursive on the fuel (which is represented using a nat
).
Intuitively, the equivalence of the two functions (before and after transformation) relies on being able to compute a sufficient amount of fuel prior to the first call, and on the fact that such a fuel amount exists to begin with.
In the case of RepeatMatcher
, the function is always called with min + length input + 1
as initial fuel,
and the matcher invariant proves that this amount is sufficient.
Most other examples are frontend functions (see spec/Frontend.v
).
An experimental generic combinator for fuel-based recursion was used to implement some of the frontend functions (see utils/Return.v
);
one might re-implement all fuel-based functions using that combinator.
Note 1
Return
was designed to ease implementation of functions which
- Are recursive on a fuel argument;
- Use imperative
return
statements to preemptively return from the function.Functions which do not use preemptive returns might benefit from an simpler version which only implements the first functionality.
Note 2 One exception is
compileToCharSet_ClassAtom
, which is instead implemented using a two-layered definition (see its definition inspec/Semantics.v
). If further experiments withReturn
are conclusive, one might want to change it to use a fuel-based solution as well.
The mechanization leaves some operations abstract. All of these operations are bundled with a functor, whose module parameter provides the missing types and operations. The following sub-sections will explain what is left abstract, and the motivation for leaving these abstract.
Most operations manipulating characters (e.g. numericValue
, toUpperCase
, simpleCaseFold
, ...) are left abstract.
This is motivated by the two following observations:
- These are typically not defined in the specification: the specification instead relies on the Unicode Standard doing so;
- The specification does not make any assumption about their behavior (other than their safety and termination).
As such, the mechanization of these functions was considered out of the scope of this work.
The specification technically defines 2 different engines (3 since the 2024 edition): one "regular" engine, and a unicode one.
In the specification, unicode mode is implemented by delegating some low-level operations (e.g. character canonicalization, string decoding) to different functions, one for each mode, e.g.
if flags.unicode then do_unicode () else do_utf ()
These operations overlap with the operations we would typically leave abstract, so the two modes are instead implemented as two different instantiations of the aforementioned functor:
Parameters.do ()
Example: from the specification:
22.2.2.7.3 Canonicalize ( rer, ch )
The abstract operation Canonicalize takes arguments rer (a RegExp Record) and ch (a character) and returns a character. It performs the following steps when called:
- If rer.[[Unicode]] is true and rer.[[IgnoreCase]] is true, then
- If the file CaseFolding.txt of the Unicode Character Database provides a simple or common case folding mapping for ch, return the result of applying that mapping to ch.
- Return ch.
- If rer.[[IgnoreCase]] is false, return ch.
- Assert: ch is a UTF-16 code unit.
- Let cp be the code point whose numeric value is the numeric value of ch.
- Let u be the result of toUppercase(« cp »), according to the Unicode Default Case Conversion algorithm.
- Let uStr be CodePointsToString(u).
- If the length of uStr ≠ 1, return ch.
- Let cu be uStr's single code unit element.
- If the numeric value of ch ≥ 128 and the numeric value of cu < 128, return ch.
- Return cu.
This is abstracted into a
canonicalize
parameter, which either implements theif
-branch of step 1, or steps 2 through 10.
Some of the functions which were left abstract (see previous sub-section) themselves rely on other functions from the specification.
Some of these other functions were mechanized (CodePointAt
, AdvanceStringIndex
, GetStringIndex
), and are extracted so that OCaml implementations can use them to implement the aforementioned abstract functions.
The axiomatization of functions left abstract is minimal, in the sense that only axioms which are necessary to ensure the safety and termination of the specification: these functions should most likely satisfy other properties to be considered correct.
For instance,
canonicalize
has no applicable axiom, which would allow for characters such thatcanonicalize c1 = c2 /\ canonicalize c2 = c1which would yield some surprising behaviors. In fact,
canonicalize
should most likely be idempotent:forall c, canonicalize c = canonicalize (canonicalize c)
The specification uses mutable arrays in one specific instance: to store the captured (indices), as part of MatchState
.
Every time a group is captured, this array is updated in place.
One could mechanize this by augmenting the Result
monad to also emulate a notion of mutable state.
This could however be cumbersome, and the way arrays are mutated allow for a simpler solution: arrays are always copied right before being mutated, and the (local) copy is the array being mutated.
This means that, rather than using a monad-based solution, the mechanization uses immutable arrays, and a new array shadowing the previous one is generated at each mutation, which correctly emulates the specification since the mutated array is local.
Example: from the specification:
22.2.2.3.1 RepeatMatcher ( m, min, max, greedy, x, c, parenIndex, parenCount )
[...] 3. Let cap be a copy of x.[[Captures]]. 4. For each integer k in the inclusive interval from parenIndex + 1 to parenIndex + parenCount, set cap[k] to undefined. [...]
Some frontend functions (e.g. exec
, test
, matchAll
, ...) are mechanized in the spec/Frontend.v
.
Unlike the core matching algorithm which is described using some pseudocode which is completely independent from JavaScript, the description of these functions relies more heavily on JavaScript features and semantics.
The mechanization hence doesn't follow the pseudocode as closely, and some operations (e.g. casting the untyped inputs into the correct type) are not mechanized, since they don't really make sense in Gallina, and modelling these was out of the scope.
To see the kind of operations which are omitted, take a look at the wrapper for Test262, which has to implement these missing functionalities.
A potentially interesting direction would be to integrate these functions in a mechanization of JavaScript lacking regex support, and call this mechanization from there.