Skip to content

Latest commit

 

History

History
243 lines (181 loc) · 13.3 KB

Differences.md

File metadata and controls

243 lines (181 loc) · 13.3 KB

Differences with the ECMAScript specification, omissions, and other implementation choices

Despite the main design choices of this project being to stick as closely as possible to the spec, some divergences were to be expected. We document some of them here.

Parsing and representation of regexes

Parsing

The mechanization does not include a parser for regexes. One of the main reasons being that the specification provides two different grammars for regexes:

  • The main one is defined in section 22.2.1;
  • The "legacy" one is defined in section B.1.2.

The situation surrounding these two grammars is a bit foggy:

  • The specification states that

    The ECMAScript language syntax and semantics defined in this annex are required when the ECMAScript host is a web browser. The content of this annex is normative but optional if the ECMAScript host is not a web browser.

    Note This annex describes various legacy features and other characteristics of web browser ECMAScript hosts. All of the language features and behaviours specified in this annex have one or more undesirable characteristics and in the absence of legacy usage would be removed from this specification. However, the usage of these features by large numbers of existing web pages means that web browsers must continue to support them. The specifications in this annex define the requirements for interoperable implementations of these legacy features.

    These features are not considered part of the core ECMAScript language. Programmers should not use or assume the existence of these features and behaviours when writing new ECMAScript code. ECMAScript implementations are discouraged from implementing these features unless the implementation is part of a web browser or is required to run the same legacy ECMAScript code that web browsers encounter.

  • Yet, from our experience, most engines (whether they are web browsers or not, as notably demonstrated by Node.js) accept the legacy syntax.

Hence, it is unclear which grammar should be mechanized.

The Regex type

The datatype representing (ASTs of) regexes in the mechanization is "flat", in the sense that all regexes features are represented using that single type. This means that features can be nested without any constraints, which should allow to represent any regex generated by any of the two grammars, whereas these grammars sometimes prevent some nesting of features.

Example: the main grammar does not allow the following regex

/(?=a)*/

because quantifiers (such as *) cannot be applied to assertions (such as (?=...)). Note however that the following regex is allowed

/(?:(?=a))*/

Put another way: the choice was made not to distinguish between alternatives, terms, assertions, atoms, etc.

Omission of certain syntactic restrictions

The specification prevents the construction of some regexes for parsing reasons.

Example: a character literal in a regex cannot be one of the "SyntaxCharacter"; from the spec

22.2.1 Patterns

Atom[UnicodeMode, N] ::
  PatternCharacter
  [...]
SyntaxCharacter :: one of
  ^ $ \ . * + ? ( ) [ ] { } |
PatternCharacter ::
  SourceCharacter but not SyntaxCharacter

Since these restrictions are there purely for parsing reasons, these are not enforced in the datatype used to represent regexes.

For similar reasons, the specification sometimes defines variants of a constructor. Such an example are ClassAtom and ClassAtomNoDash.

Example: from the specification

22.2.1 Patterns

[...]
ClassAtom[UnicodeMode] ::
  -
  ClassAtomNoDash[?UnicodeMode]

ClassAtomNoDash[UnicodeMode] ::
  SourceCharacter but not one of \ or ] or -
  \ ClassEscape[?UnicodeMode]
[...]

These are typically collapsed into one single constructor in the mechanization.

Potential failures

The specification relies on operations which could technically fail. The most obvious examples being

  1. Assertions, e.g. Assert: 0 ≤ index ≤ the number of elements in Input.;
  2. List accesses, e.g. Let ch be the character Input[index]..

Assertions are assumed to hold, as stated by the specification itself

A step that begins with “Assert:” asserts an invariant condition of its algorithm. Such assertions are used to make explicit algorithmic invariants that would otherwise be implicit. Such assertions add no additional semantic requirements and hence need not be checked by an implementation. They are used simply to clarify algorithms.

List accesses (and other fallible operations) can be interpreted in the same way, i.e. their usage in the specification is an assertion that they do succeed.

Rather than positing their success, the mechanization accounts for potential failure by using the error monad. Proofs are then conducted to show formally that these assumptions are indeed correct.

"Numeric checks"

The specification contains a few assertions which ensure that numbers can be safely represented in JavaScript.

Some examples:

It is a Syntax Error if CountLeftCapturingParensWithin(Pattern) ≥ 2^32 - 1.

Assert: n < 2^32 - 1.

These checks are currently not mechanized, as unbound integers are used in Coq, OCaml and JavaScript.

Non-structurally recursive functions

Some functions in the specification, RepeatMatcher being the the most infamous one, are generally recursive. Such definitions would be rejected by Coq, as it cannot syntactically ensure their termination.

We modified such definitions into fuel-based functions. This transformation allows Coq to ensure their termination, since the function is then structurally recursive on the fuel (which is represented using a nat).

Intuitively, the equivalence of the two functions (before and after transformation) relies on being able to compute a sufficient amount of fuel prior to the first call, and on the fact that such a fuel amount exists to begin with.

In the case of RepeatMatcher, the function is always called with min + length input + 1 as initial fuel, and the matcher invariant proves that this amount is sufficient.

Most other examples are frontend functions (see spec/Frontend.v).

An experimental generic combinator for fuel-based recursion was used to implement some of the frontend functions (see utils/Return.v); one might re-implement all fuel-based functions using that combinator.

Note 1 Return was designed to ease implementation of functions which

  1. Are recursive on a fuel argument;
  2. Use imperative return statements to preemptively return from the function.

Functions which do not use preemptive returns might benefit from an simpler version which only implements the first functionality.

Note 2 One exception is compileToCharSet_ClassAtom, which is instead implemented using a two-layered definition (see its definition in spec/Semantics.v). If further experiments with Return are conclusive, one might want to change it to use a fuel-based solution as well.

Abstraction layer

The mechanization leaves some operations abstract. All of these operations are bundled with a functor, whose module parameter provides the missing types and operations. The following sub-sections will explain what is left abstract, and the motivation for leaving these abstract.

Character manipulation

Most operations manipulating characters (e.g. numericValue, toUpperCase, simpleCaseFold, ...) are left abstract. This is motivated by the two following observations:

  1. These are typically not defined in the specification: the specification instead relies on the Unicode Standard doing so;
  2. The specification does not make any assumption about their behavior (other than their safety and termination).

As such, the mechanization of these functions was considered out of the scope of this work.

Engine modes

The specification technically defines 2 different engines (3 since the 2024 edition): one "regular" engine, and a unicode one.

In the specification, unicode mode is implemented by delegating some low-level operations (e.g. character canonicalization, string decoding) to different functions, one for each mode, e.g.

if flags.unicode then do_unicode () else do_utf ()

These operations overlap with the operations we would typically leave abstract, so the two modes are instead implemented as two different instantiations of the aforementioned functor:

Parameters.do ()

Example: from the specification:

22.2.2.7.3 Canonicalize ( rer, ch )

The abstract operation Canonicalize takes arguments rer (a RegExp Record) and ch (a character) and returns a character. It performs the following steps when called:

  1. If rer.[[Unicode]] is true and rer.[[IgnoreCase]] is true, then
    1. If the file CaseFolding.txt of the Unicode Character Database provides a simple or common case folding mapping for ch, return the result of applying that mapping to ch.
    2. Return ch.
  2. If rer.[[IgnoreCase]] is false, return ch.
  3. Assert: ch is a UTF-16 code unit.
  4. Let cp be the code point whose numeric value is the numeric value of ch.
  5. Let u be the result of toUppercase(« cp »), according to the Unicode Default Case Conversion algorithm.
  6. Let uStr be CodePointsToString(u).
  7. If the length of uStr ≠ 1, return ch.
  8. Let cu be uStr's single code unit element.
  9. If the numeric value of ch ≥ 128 and the numeric value of cu < 128, return ch.
  10. Return cu.

This is abstracted into a canonicalize parameter, which either implements the if-branch of step 1, or steps 2 through 10.

"Fragments" from the specification

Some of the functions which were left abstract (see previous sub-section) themselves rely on other functions from the specification. Some of these other functions were mechanized (CodePointAt, AdvanceStringIndex, GetStringIndex), and are extracted so that OCaml implementations can use them to implement the aforementioned abstract functions.

Axioms

The axiomatization of functions left abstract is minimal, in the sense that only axioms which are necessary to ensure the safety and termination of the specification: these functions should most likely satisfy other properties to be considered correct.

For instance, canonicalize has no applicable axiom, which would allow for characters such that

canonicalize c1 = c2 /\ canonicalize c2 = c1

which would yield some surprising behaviors. In fact, canonicalize should most likely be idempotent:

forall c, canonicalize c = canonicalize (canonicalize c)

Extraction

Mutable arrays

The specification uses mutable arrays in one specific instance: to store the captured (indices), as part of MatchState. Every time a group is captured, this array is updated in place. One could mechanize this by augmenting the Result monad to also emulate a notion of mutable state.

This could however be cumbersome, and the way arrays are mutated allow for a simpler solution: arrays are always copied right before being mutated, and the (local) copy is the array being mutated.

This means that, rather than using a monad-based solution, the mechanization uses immutable arrays, and a new array shadowing the previous one is generated at each mutation, which correctly emulates the specification since the mutated array is local.

Example: from the specification:

22.2.2.3.1 RepeatMatcher ( m, min, max, greedy, x, c, parenIndex, parenCount )

[...] 3. Let cap be a copy of x.[[Captures]]. 4. For each integer k in the inclusive interval from parenIndex + 1 to parenIndex + parenCount, set cap[k] to undefined. [...]

Frontend functions

Some frontend functions (e.g. exec, test, matchAll, ...) are mechanized in the spec/Frontend.v. Unlike the core matching algorithm which is described using some pseudocode which is completely independent from JavaScript, the description of these functions relies more heavily on JavaScript features and semantics.

The mechanization hence doesn't follow the pseudocode as closely, and some operations (e.g. casting the untyped inputs into the correct type) are not mechanized, since they don't really make sense in Gallina, and modelling these was out of the scope.

To see the kind of operations which are omitted, take a look at the wrapper for Test262, which has to implement these missing functionalities.

A potentially interesting direction would be to integrate these functions in a mechanization of JavaScript lacking regex support, and call this mechanization from there.