-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(stdlib): Regular Expressions #680
Conversation
Awesome work, and as you can imagine I love this API :D Haven't dove super deep into the code yet, but here's how I want to rename stuff (largely the names don't need
And then the last thing would be renaming |
After discussing with @ospencer (we can't use
I will update the main post to reflect these changes. |
Wow! Great work here! |
7e63a36
to
b7c684f
Compare
So, I haven't gotten to look through the implementation yet, but the first thought I had while looking over the tests was "what is the actual public API of this library?" - I appreciate that you have added that overview in the body of the PR, but as someone who usually jumps to tests to see the usage, it feels weird that our tests need to be written with helpers. I believe that indicates that the API might be too complex for ergonomic usage. As for the public API, I believe that we shouldn't have functions in a What are Why does the I'm also wondering why we return an I'm not sure I like the replace syntax. It feels like too much of a deviation from the languages people are going to be coming from. I'll probably have more, but I need to sign off right now. |
Deferring to @ospencer on this one.
They are the position offsets (start and end offset). Capture groups are not always fixed-size, so we need to return both.
This behavior was taken from Racket, and I think it makes sense. When you say "give me all of the matches in this string", that is an operation which returns zero+ matches, and users will want to simply iterate over that list. If you say "give me all of the groups in this match", an array makes more sense because (a) you know precisely the number of groups which are defined in the regex and (b) you are more likely to want to do random access on that list than iterate over it. An array is more conducive to that scenario.
I suppose it could, but I feel like it is nicer style-wise to use
Can you elaborate more on this? My impression is that I personally feel that, while the API is a bit verbose, it leans into the |
In this case, I thought exceptions over options here were more ergonomic since you always knew what groups would match (unless you were compiling user-defined regexes) and it'd be annoying to have to unwrap this option. That gave us both the exception and non-exception versions. I did forget that groups can be optional, i.e. |
Given that Perl has the world's premier regex engine (lmfao) I think @phated has a point here. Perl (and JS) use |
a45e027
to
4e562f1
Compare
@ospencer @phated I've updated the module to contain JS-like replace syntax and to use the proper Graindoc syntax. Here is the updated doc for
Replaces the first match for the given regular expression contained within the given string with the given replacement.
Parameters:
Returns:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of tiny things, but other than those I think this is solid enough to go in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/^(?=approved\!)/
Though it does irk me to see those warnings in the test output. Could you send a PR that just disables that warning for now until we resolve that issue? I think it's #190 but I'm not sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holy cow, batman! Most of my comments are surface things because I barely followed the regexp parser/matcher logic.
I think there are still a few things we need to bikeshed (and I really need to review the tests still)
472fbb8
to
e11220a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This pull request adds a new module to the standard library for working with regular expressions. The flavor of regular expressions added here is adapted from Racket, providing a full Perl-like regex syntax for users to work with.
Regular Expressions in Grain
Compiles the given pattern string into a regular expression object.
For an overview of the theory regular expressions in general, readers are referred
to "Mastering Regular Expressions" by Friedl, or numerous alternative resources online.
Regular expressions are a combination of normal and special characters. A normal
character in a pattern will match a one-character string containing that character.
Moreover, if there are two regular expressions
A
andB
, they can be concatenatedinto a regular expression
AB
. If a stringp
matchesA
andq
matchesB
,then
pq
will matchAB
.The special character sequences are as follows:
.
- Matches any character, except for a newline in multi-line mode^
- Matches the beginning of the input, or after a newline (\n
) in multi-line mode$
- Matches the end of the input, or the before a newline (\n
) in multi-line mode«re»*
- Matches«re»
zero or more times«re»+
- Matches«re»
one or more times«re»?
- Matches«re»
zero or one times«re»{«n»}
- Matches«re»
exactly«n»
times«re»{«n»,}
- Matches«re»
«n»
or more times«re»{,«m»}
- Matches«re»
zero to«m»
times«re»{«n»,«m»}
- Matches«re»
between«n»
and«m»
times«re»{}
- Matches«re»
zero or more times[«rng»]
- Matches any character in«rng»
(see below)[^«rng»]
- Matches any character not in«rng»
(see below)\«n»
- Matches the latest match for group«n»
(one-indexed)\b
- Matches the boundary of\w*
(\w
defined below, under "basic classes")\B
- Matches where\b
does not\p{«property»}
- Matches any character with Unicode property«property»
(see below)\P{«property»}
- Matches any character without Unicode property«property»
(see below)(«re»)
- Matches«re»
, storing the result in a group(?:«re»)
- Matches«re»
without storing the result in a group(?«mode»:«re») - Matches
«re»with the mode settings specified by
«mode»` using the following syntax:«mode»i
- The same as«mode»
, but with case-insensitivity enabled (temporarily not supported until Char Unicode Data and Conversions #661 is resolved)«mode»-i
- The same as«mode»
, but with case-insensitivity disabled (the default)«mode»m
/«mode»-s
- The same as«mode»
, but with multi-line mode enabled«mode»-m
/«mode»s
- The same as«mode»
, but with multi-line mode disabled (the default)(?«tst»«re1»|«re2»)
- Will match«re1»
if«tst»
, otherwise will match«re2»
. The following options are available for«tst»
(«n»)
- Will be true if group«n»
has a match(?=«re»)
- Will be true if«re»
matches the next sequence(?!«re»)
- Will be true if«re»
does not match the next sequence(?<=«re»)
- Will be true if«re»
matches the preceding sequence(?<!«re»)
- Will be true if«re»
does not match the preceding sequence(?«tst»«re»)
- Equivalent to(?«tst»«re»|)
Character ranges (referred to as
«rng»
above) have the following syntax:«c»
- Matches the character«c»
exactly«c1»-«c2»
- Matches any character with a character code between the character code for«c1»
and the code for«c2»
These forms can be repeated any number of times, which will construct a range of their union. That is,
[ba-c]
and[a-c]
are equivalent ranges.Additionally, there are the following special cases:
]
as the first character of the range will match a]
-
as the first or last character of the range will match a-
^
in any position other than the first position will match a^
\«c»
, where«c»
is a non-alphabetic character, will match«c»
Furthermore, ranges can include character classes, which are predefined commonly-used
sets of characters. There are two "flavors" of these: basic classes and POSIX classes.
Both are provided for ease of use and to maximize compatibility with other regular
expression engines, so feel free to use whichever is most convenient.
The basic classes are as follows:
\d
- Matches0-9
\D
- Matches characters not in\d
\w
- Matchesa-z
,A-Z
,0-9
, and_
\W
- Matches characters not in\w
\s
- Matches space, tab, formfeed, and return\S
- Matches characters not in\s
The POSIX classes are as follows:
[:alpha:]
- Matchesa-z
andA-Z
[:upper:]
- MatchesA-Z
[:lower:]
- Matchesa-z
[:digit:]
- Matches0-9
[:xdigit:]
- Matches0-9
,a-f
, andA-F
[:alnum:]
- Matchesa-z
,A-Z
, and0-9
[:word:]
- Matchesa-z
,A-Z
,0-9
, and_
[:blank:]
- Matches space and tab[:space:]
- Matches space, tab, newline, formfeed, and return[:graph:]
- Matches all ASCII characters which use ink when printed[:print:]
- Matches space, tab, and all ASCII ink users[:cntrl:]
- Contains all characters with code points < 32[:ascii:]
- Contains all ASCII charactersFinally, the following is the list of supported Unicode properties. Note that
until #661 is resolved, matches using
\p
and\P
are disabled,as Grain is not currently able to determine membership in these character classes.
These class codes come from this portion of the Unicode standard:
https://www.unicode.org/reports/tr44/#General_Category_Values
Ll
- Letter, lowercaseLu
- Letter, uppercaseLt
- Letter, titlecaseLm
- Letter, modifierL&
- Union ofLl
,Lu
,Lt
, andLm
Lo
- Letter, otherL
- Union ofL&
andLo
Nd
- Number, decimal digitNl
- Number, letterNo
- Number, otherN
- Union ofNd
,Nl
, andNo
Ps
- Punctuation, openPe
- Punctuation, closePi
- Punctuation, initial quotePf
- Punctuation, final quotePc
- Punctuation, connectorPd
- Punctuation, dashPo
- Punctuation, otherP
- Union ofPs
,Pe
,Pi
,Pf
,Pc
,Pd
, andPo
Mn
- Mark, non-spacingMc
- Mark, spacing combiningMe
- Mark, enclosingM
- Union ofMn
,Mc
, andMe
Sc
- Symbol, currencySk
- Symbol, modifierSm
- Symbol, mathSo
- Symbol, otherS
- Union ofSc
,Sk
,Sm
, andSo
Zl
- Separator, lineZp
- Separator, paragraphZs
- Separator, spaceZ
- Union ofZl
,Zp
, andZs
Cc
- Other, controlCf
- Other, formatCs
- Other, surrogateCn
- Other, not assignedCo
- Other, private useC
- Union ofCc
,Cf
,Cs
,Cn
, andCo
.
- Union of all Unicode categoriesPublic API (for working with regular expressions)
Other Changes
(crossed-out changes have been delegated to linked separate PRs)
Mutually recursive(feat: Support mutually recursive data definitions #725)record
andenum
types are now supported, and can be defined by separating data definitions with a comma(feat(stdlib): Add List.join and Array.join functions #722)String.join
has been added to theString
module, which allows the concatenation of a list of strings(feat(stdlib): Add String.chatAt function #721)String.charAt
has been added to theString
module, which allows the retrieval of a character in a string by indexA bug in record printing has been fixed which caused the closing brace of a nested record to have incorrect indentation(fix(stdlib): Correctly indent nested record braces when printing #724)(obsoleted by concurrent work onmin
andmax
functions have been added toPervasives
, which compute the minimum and maximum of two numbers, respectivelyNumber
library)(feat(stdlib): Add Float32/Float64 constants for infinity/nan #720)infinity
andnan
constants have been added toFloat32
andFloat64
(feat(stdlib): Add Array.slice function #727)Array.subArray
has been added to theArray
module, which allows for slicing an array(feat(stdlib): Add Array.zip function #719)Array.zip
has been added to theArray
module, which allows for easily combining elements of two arrays into a single arrayCloses #247.