Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stdlib): Regular Expressions #680

Merged
merged 6 commits into from
Sep 14, 2021
Merged

feat(stdlib): Regular Expressions #680

merged 6 commits into from
Sep 14, 2021

Conversation

peblair
Copy link
Member

@peblair peblair commented May 30, 2021

This pull request adds a new module to the standard library for working with regular expressions. The flavor of regular expressions added here is adapted from Racket, providing a full Perl-like regex syntax for users to work with.

Regular Expressions in Grain

// @param regexString: String - The regular expression to compile
// @returns Result<RegularExpression>
export let make = (regexString: String)

Compiles the given pattern string into a regular expression object.

For an overview of the theory regular expressions in general, readers are referred
to "Mastering Regular Expressions" by Friedl, or numerous alternative resources online.

Regular expressions are a combination of normal and special characters. A normal
character in a pattern will match a one-character string containing that character.
Moreover, if there are two regular expressions A and B, they can be concatenated
into a regular expression AB. If a string p matches A and q matches B,
then pq will match AB.

The special character sequences are as follows:

  • . - Matches any character, except for a newline in multi-line mode
  • ^ - Matches the beginning of the input, or after a newline (\n) in multi-line mode
  • $ - Matches the end of the input, or the before a newline (\n) in multi-line mode
  • «re»* - Matches «re» zero or more times
  • «re»+ - Matches «re» one or more times
  • «re»? - Matches «re» zero or one times
  • «re»{«n»} - Matches «re» exactly «n» times
  • «re»{«n»,} - Matches «re» «n» or more times
  • «re»{,«m»} - Matches «re» zero to «m» times
  • «re»{«n»,«m»} - Matches «re» between «n» and «m» times
  • «re»{} - Matches «re» zero or more times
  • [«rng»] - Matches any character in «rng» (see below)
  • [^«rng»] - Matches any character not in «rng» (see below)
  • \«n» - Matches the latest match for group «n» (one-indexed)
  • \b - Matches the boundary of \w* (\w defined below, under "basic classes")
  • \B - Matches where \b does not
  • \p{«property»} - Matches any character with Unicode property «property» (see below)
  • \P{«property»} - Matches any character without Unicode property «property» (see below)
  • («re») - Matches «re», storing the result in a group
  • (?:«re») - Matches «re» without storing the result in a group
  • (?«mode»:«re») - Matches «re»with the mode settings specified by«mode»` using the following syntax:
    • «mode»i - The same as «mode», but with case-insensitivity enabled (temporarily not supported until Char Unicode Data and Conversions #661 is resolved)
    • «mode»-i - The same as «mode», but with case-insensitivity disabled (the default)
    • «mode»m / «mode»-s - The same as «mode», but with multi-line mode enabled
    • «mode»-m / «mode»s - The same as «mode», but with multi-line mode disabled (the default)
    • An empty string, which will not change any mode settings
  • (?«tst»«re1»|«re2») - Will match «re1» if «tst», otherwise will match «re2». The following options are available for «tst»
    • («n») - Will be true if group «n» has a match
    • (?=«re») - Will be true if «re» matches the next sequence
    • (?!«re») - Will be true if «re» does not match the next sequence
    • (?<=«re») - Will be true if «re» matches the preceding sequence
    • (?<!«re») - Will be true if «re» does not match the preceding sequence
  • (?«tst»«re») - Equivalent to (?«tst»«re»|)
  • Finally, basic classes (defined below) can also appear outside of character ranges.

Character ranges (referred to as «rng» above) have the following syntax:

  • «c» - Matches the character «c» exactly
  • «c1»-«c2» - Matches any character with a character code between the character code for «c1» and the code for «c2»

These forms can be repeated any number of times, which will construct a range of their union. That is, [ba-c] and [a-c] are equivalent ranges.
Additionally, there are the following special cases:

  • A ] as the first character of the range will match a ]
  • A - as the first or last character of the range will match a -
  • A ^ in any position other than the first position will match a ^
  • \«c», where «c» is a non-alphabetic character, will match «c»

Furthermore, ranges can include character classes, which are predefined commonly-used
sets of characters. There are two "flavors" of these: basic classes and POSIX classes.
Both are provided for ease of use and to maximize compatibility with other regular
expression engines, so feel free to use whichever is most convenient.

The basic classes are as follows:

  • \d - Matches 0-9
  • \D - Matches characters not in \d
  • \w - Matches a-z, A-Z, 0-9, and _
  • \W - Matches characters not in \w
  • \s - Matches space, tab, formfeed, and return
  • \S - Matches characters not in \s
    The POSIX classes are as follows:
  • [:alpha:] - Matches a-z and A-Z
  • [:upper:] - Matches A-Z
  • [:lower:] - Matches a-z
  • [:digit:] - Matches 0-9
  • [:xdigit:] - Matches 0-9, a-f, and A-F
  • [:alnum:] - Matches a-z, A-Z, and 0-9
  • [:word:] - Matches a-z, A-Z, 0-9, and _
  • [:blank:] - Matches space and tab
  • [:space:] - Matches space, tab, newline, formfeed, and return
  • [:graph:] - Matches all ASCII characters which use ink when printed
  • [:print:] - Matches space, tab, and all ASCII ink users
  • [:cntrl:] - Contains all characters with code points < 32
  • [:ascii:] - Contains all ASCII characters

Finally, the following is the list of supported Unicode properties. Note that
until #661 is resolved, matches using \p and \P are disabled,
as Grain is not currently able to determine membership in these character classes.
These class codes come from this portion of the Unicode standard:
https://www.unicode.org/reports/tr44/#General_Category_Values

  • Ll - Letter, lowercase
  • Lu - Letter, uppercase
  • Lt - Letter, titlecase
  • Lm - Letter, modifier
  • L& - Union of Ll, Lu, Lt, and Lm
  • Lo - Letter, other
  • L - Union of L& and Lo
  • Nd - Number, decimal digit
  • Nl - Number, letter
  • No - Number, other
  • N - Union of Nd, Nl, and No
  • Ps - Punctuation, open
  • Pe - Punctuation, close
  • Pi - Punctuation, initial quote
  • Pf - Punctuation, final quote
  • Pc - Punctuation, connector
  • Pd - Punctuation, dash
  • Po - Punctuation, other
  • P - Union of Ps, Pe, Pi, Pf, Pc, Pd, and Po
  • Mn - Mark, non-spacing
  • Mc - Mark, spacing combining
  • Me - Mark, enclosing
  • M - Union of Mn, Mc, and Me
  • Sc - Symbol, currency
  • Sk - Symbol, modifier
  • Sm - Symbol, math
  • So - Symbol, other
  • S - Union of Sc, Sk, Sm, and So
  • Zl - Separator, line
  • Zp - Separator, paragraph
  • Zs - Separator, space
  • Z - Union of Zl, Zp, and Zs
  • Cc - Other, control
  • Cf - Other, format
  • Cs - Other, surrogate
  • Cn - Other, not assigned
  • Co - Other, private use
  • C - Union of Cc, Cf, Cs, Cn, and Co
  • . - Union of all Unicode categories

Public API (for working with regular expressions)

/*
 The user-facing object which contains the results
 of a regular expression match.
 */
export record MatchResult {
  // Returns the contents of the given match group, raising
  // an exception if that group was not matched.
  group: Number -> String,
  // Returns the contents of the given group
  groupOptional: Number -> Option<String>,
  // Returns the position of the given match group, raising
  // an exception if that group was not matched.
  groupPosition: Number -> (Number, Number),
  // Returns the position of the given group
  groupPositionOptional: Number -> Option<(Number, Number)>,
  // Returns the number of defined groups in this match object (includes group 0)
  numGroups: Number,
  // Returns the contents of all groups matched in this match object
  allGroups: () -> Array<Option<String>>,
  // Returns the positions of all groups matched in this match object
  allGroupPositions: () -> Array<Option<(Number, Number)>>,
}

// Returns true if the given regular expression has a match in the given string
// @param rx: RegularExpression - The regular expression to search for
// @param string: String - The string to search within
// @returns Bool
export let isMatch = (rx: RegularExpression, string: String)

// Returns true if the given regular expression has a match in the given string within the given start/end offsets
// @param rx: RegularExpression - The regular expression to search for
// @param string: String - The string to search within
// @param start: Number - The start offset to search within
// @param end: Number - The end offset to search within
// @returns Bool
export let isMatchRange = (rx: RegularExpression, string: String, start: Number, end: Number)

// Returns the first match for the given regular expression contained within the given string
// @param rx: RegularExpression - The regular expression to search for
// @param string: String - The string to search within
// @returns Option<MatchResult>
export let find = (rx: RegularExpression, string: String)

// Returns the first match for the given regular expression contained within the given string,
// within the given start/end range.
// @param rx: RegularExpression - The regular expression to search for
// @param string: String - The string to search within
// @param start: Number - The start offset to search within
// @param end: Number - The end offset to search within
// @returns Option<MatchResult>
export let findRange = (rx: RegularExpression, string: String, start: Number, end: Number)

// Returns the all matches for the given regular expression contained within the given string
// @param rx: RegularExpression - The regular expression to search for
// @param string: String - The string to search within
// @returns List<MatchResult>
export let findAll = (rx: RegularExpression, string: String)

// Returns all matches for the given regular expression contained within the given string,
// within the given start/end range.
// @param rx: RegularExpression - The regular expression to search for
// @param string: String - The string to search within
// @param start: Number - The start offset to search within
// @param end: Number - The end offset to search within
// @returns List<MatchResult>
export let findAllRange = (rx: RegularExpression, string: String, start: Number, end: Number)

// Replaces the first match for the given regular expression contained within the given string with the given replacement.
// Replacement strings support the following syntax:
// - `&` - Replaced with the text of the matching portion of input (e.g. for `(foo)`, the search string `foo bar`, and the replacement `baz &`, the result will be `baz foo bar`)
// - `\n` / `\nn` (where `n` is a digit) - Replaced with the text of group `nn`
// - `\$` - Does nothing (this exists to support replacement strings such as `\4\$0`, which will place the contents of group 4 prior to a zero)
// - `\&` / `\\` - Literal `&` and `\`, respectively
// - Any other character will be placed as-is in the replaced output.
//
// @param rx: RegularExpression - The regular expression to search for
// @param toSearch: String - The string to search within
// @param replacement: String - The string to replace matches with
// @returns String
export let replace = (rx: RegularExpression, toSearch: String, replacement: String)

// Replaces all matches for the given regular expression contained within the given string with the given replacement.
// See `replace` for replacement string syntax.
//
// @param rx: RegularExpression - The regular expression to search for
// @param toSearch: String - The string to search within
// @param replacement: String - The string to replace matches with
// @returns String
export let replaceAll = (rx: RegularExpression, toSearch: String, replacement: String)

Other Changes

(crossed-out changes have been delegated to linked separate PRs)

Closes #247.

@peblair peblair added the stdlib label May 30, 2021
@peblair peblair requested a review from a team May 30, 2021 19:56
@ospencer
Copy link
Member

ospencer commented Jun 1, 2021

Awesome work, and as you can imagine I love this API :D

Haven't dove super deep into the code yet, but here's how I want to rename stuff (largely the names don't need regex in them because they're in the regex lib, but some other thoughts too):

export record MatchResult {
  group: Number -> String,
  groupOpt: Number -> Option<String>,
  groupPosition: Number -> (Number, Number),
  groupPositionOpt: Number -> Option<(Number, Number)>,
  numGroups: Number,
  allGroups: () -> Array<Option<String>>, // <-- also why is this Array<Option<String>> instead of Array<String>?
  allGroupPositions: () -> Array<Option<(Number, Number)>>, // Same question here
}

export let isMatch = (rx: RegularExpression, string: String) // `test` is also a good name here
export let isMatchRange = (rx: RegularExpression, string: String, start: Number, end: Number) // or `testRange`
export let match = (rx: RegularExpression, string: String)
export let matchRange = (rx: RegularExpression, string: String, start: Number, end: Number)
export let matchAll = (rx: RegularExpression, string: String)
export let matchAllRange = (rx: RegularExpression, string: String, start: Number, end: Number)
export let replace = (rx: RegularExpression, toSearch: String, replacement: String)
export let replaceAll = (rx: RegularExpression, toSearch: String, replacement: String)

And then the last thing would be renaming makeRegex -> make or something like compile 👍

@peblair
Copy link
Member Author

peblair commented Jun 1, 2021

After discussing with @ospencer (we can't use match as a function name, unfortunately), I've pushed this API:

export record MatchResult {
  group: Number -> String,
  groupOpt: Number -> Option<String>,
  groupPosition: Number -> (Number, Number),
  groupPositionOpt: Number -> Option<(Number, Number)>,
  numGroups: Number,
  allGroups: () -> Array<Option<String>>, // <-- also why is this Array<Option<String>> instead of Array<String>?
  allGroupPositions: () -> Array<Option<(Number, Number)>>, // Same question here
}

export let isMatch = (rx: RegularExpression, string: String)
export let isMatchRange = (rx: RegularExpression, string: String, start: Number, end: Number)
export let find = (rx: RegularExpression, string: String)
export let findRange = (rx: RegularExpression, string: String, start: Number, end: Number)
export let findAll = (rx: RegularExpression, string: String)
export let findAllRange = (rx: RegularExpression, string: String, start: Number, end: Number)
export let replace = (rx: RegularExpression, toSearch: String, replacement: String)
export let replaceAll = (rx: RegularExpression, toSearch: String, replacement: String)

I will update the main post to reflect these changes.

@marcusroberts
Copy link
Member

Wow! Great work here!

@ospencer ospencer changed the title feat(stdlib)!: Regular Expressions feat(stdlib): Regular Expressions Jun 1, 2021
@phated
Copy link
Member

phated commented Jun 20, 2021

So, I haven't gotten to look through the implementation yet, but the first thought I had while looking over the tests was "what is the actual public API of this library?" - I appreciate that you have added that overview in the body of the PR, but as someone who usually jumps to tests to see the usage, it feels weird that our tests need to be written with helpers. I believe that indicates that the API might be too complex for ergonomic usage.

As for the public API, I believe that we shouldn't have functions in a MatchResult that throw exceptions. We can just provide functions that return Option and users can use Option.unwrap if they know for a fact that those will always be Some.

What are (Number, Number) for the groups positions? Is it group number and string position?

Why does the MatchResult retusn Arrays but the RegExp exported methods return Lists?

I'm also wondering why we return an Option<MatchResult>? It seems like this record could always be provided, even if nothing matched, maybe it would need to have a field that indicated the success status of the match? I'm not sure.

I'm not sure I like the replace syntax. It feels like too much of a deviation from the languages people are going to be coming from.

I'll probably have more, but I need to sign off right now.

@peblair
Copy link
Member Author

peblair commented Jun 26, 2021

@phated:

As for the public API, I believe that we shouldn't have functions in a MatchResult that throw exceptions. We can just provide functions that return Option and users can use Option.unwrap if they know for a fact that those will always be Some.

Deferring to @ospencer on this one.

What are (Number, Number) for the groups positions? Is it group number and string position?

They are the position offsets (start and end offset). Capture groups are not always fixed-size, so we need to return both.

Why does the MatchResult retusn Arrays but the RegExp exported methods return Lists?

This behavior was taken from Racket, and I think it makes sense. When you say "give me all of the matches in this string", that is an operation which returns zero+ matches, and users will want to simply iterate over that list. If you say "give me all of the groups in this match", an array makes more sense because (a) you know precisely the number of groups which are defined in the regex and (b) you are more likely to want to do random access on that list than iterate over it. An array is more conducive to that scenario.

I'm also wondering why we return an Option? It seems like this record could always be provided, even if nothing matched, maybe it would need to have a field that indicated the success status of the match? I'm not sure.

I suppose it could, but I feel like it is nicer style-wise to use Options here, since that allows users to use pattern-matching to determine if the match was successful. FWIW, this is what Haskell does, and OCaml sort of does the same thing (except they throw a Not_found exception instead of using an Option).

I'm not sure I like the replace syntax. It feels like too much of a deviation from the languages people are going to be coming from.

Can you elaborate more on this? My impression is that \n is pretty standard across many languages, while the other pieces of syntax are relatively niche enough that I am not sure what the equivalents are in other languages.


I personally feel that, while the API is a bit verbose, it leans into the Option and Result types to the appropriate extent (they exist for use cases like they appear in here).

@ospencer
Copy link
Member

As for the public API, I believe that we shouldn't have functions in a MatchResult that throw exceptions. We can just provide functions that return Option and users can use Option.unwrap if they know for a fact that those will always be Some.

Deferring to @ospencer on this one.

In this case, I thought exceptions over options here were more ergonomic since you always knew what groups would match (unless you were compiling user-defined regexes) and it'd be annoying to have to unwrap this option. That gave us both the exception and non-exception versions. I did forget that groups can be optional, i.e. (.*(gg)?), and in this case sometimes asking for group 2 here would lead to an exception. With that in mind, I suppose it makes sense to simplify to only having the optional version.

@ospencer
Copy link
Member

I'm not sure I like the replace syntax. It feels like too much of a deviation from the languages people are going to be coming from.

Can you elaborate more on this? My impression is that \n is pretty standard across many languages, while the other pieces of syntax are relatively niche enough that I am not sure what the equivalents are in other languages.

Given that Perl has the world's premier regex engine (lmfao) I think @phated has a point here. Perl (and JS) use $ instead of \, so replacement strings look like $2, $1 instead of \2, \1. Python and Ruby however use the \ form. I guess the difference is that Ruby has single quote strings that let you write '\1' without issue, and Python has regex strings liker'\1', versus in Grain your only choice is to always escape them "\\1". I have mixed feelings about it—on the one hand, I like \ better since it's 1-1 with backreferences. On the other hand, a lot of folks will come from JS land and Perl regex is very popular, plus there wouldn't be any need to have to escape. 🤔

@peblair peblair force-pushed the regexp branch 3 times, most recently from a45e027 to 4e562f1 Compare June 30, 2021 19:32
@peblair
Copy link
Member Author

peblair commented Jun 30, 2021

@ospencer @phated I've updated the module to contain JS-like replace syntax and to use the proper Graindoc syntax. Here is the updated doc for replace:

replace : (RegularExpression, String, String) -> String

Replaces the first match for the given regular expression contained within the given string with the given replacement.
Replacement strings support the following syntax:

  • $& - Replaced with the text of the matching portion of input (e.g. for (foo), the search string foo bar, and the replacement baz $&, the result will be baz foo bar)
  • $n / $nn (where n is a digit) - Replaced with the text of group nn
  • $$ - Replaced with a literal $
  • $. - Does nothing (this exists to support replacement strings such as $4$.0, which will place the contents of group 4 prior to a zero)
  • `$`` - Replaced with the text of the string prior to the matched subspan of text
  • $' - Replaced with the text of the string after the matched subspan of text
  • Any other character will be placed as-is in the replaced output.

Parameters:

param type description
rx RegularExpression RegularExpression - The regular expression to search for
toSearch String String - The string to search within
replacement String String - The string to replace matches with

Returns:

type description
String The replacement string with the appropriate replacement, if any

Copy link
Member

@ospencer ospencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of tiny things, but other than those I think this is solid enough to go in.

Copy link
Member

@ospencer ospencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/^(?=approved\!)/

@ospencer
Copy link
Member

ospencer commented Jul 8, 2021

Though it does irk me to see those warnings in the test output. Could you send a PR that just disables that warning for now until we resolve that issue? I think it's #190 but I'm not sure.

Copy link
Member

@phated phated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy cow, batman! Most of my comments are surface things because I barely followed the regexp parser/matcher logic.

I think there are still a few things we need to bikeshed (and I really need to review the tests still)

Copy link
Member

@marcusroberts marcusroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@peblair peblair merged commit 9601e16 into main Sep 14, 2021
@peblair peblair deleted the regexp branch September 14, 2021 23:10
@github-actions github-actions bot mentioned this pull request May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stdlib: RegExp
4 participants