Skip to content

Commit 909b1c1

Browse files
authored
0.0.5 release (#3)
* Underscore instead of hyphen for publish in Makefile. * Increment version. * Prefer shorter codepoint values when escaping for RE2. * More uniformity in tests for regex expressions. * Parameters actually accept string chars when iterating a character range. * Add regex making function. * Formatting change to bash commands in readme to make it easier to copy. * Improve readability in a few docstring examples when rendered to markdown. * Considering test parameterization. - Added `to_nfc` test. - Avoiding lru_cache when evalutating enum (just build expression on script initialization to prevent slowness for now). - Additional/improved edge cases in some tests. - Added reserved regex expressions to constants. * Including requirements-test.txt for convenience. * Use full match in test instead of search. * Short-term implementation adding ability to change default regex flavor at any point. * Update readme. * RE2 should only be required for testing. * Move `default_flavor` from `regex_toolkit.utils` to `regex_tookit.base` and add test for changing the default. * Readme clean up.
1 parent d667880 commit 909b1c1

18 files changed

+881
-760
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
!environment.yml
99
!codecov.yml
1010
!requirements-doc.txt
11+
!requirements-test.txt
1112

1213
!src/
1314
!src/*

Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
PYTHON=python3
2-
APP_NAME=regex-toolkit
2+
APP_NAME=regex_toolkit
33

44
install:
55
${PYTHON} -m pip install .

README.md

+143-27
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Most stable version from [**PyPi**](https://pypi.org/project/regex-toolkit/):
3636
[![PyPI - License](https://img.shields.io/pypi/l/regex-toolkit?style=flat-square)](https://pypi.org/project/regex-toolkit/)
3737

3838
```bash
39-
$ python3 -m pip install regex-toolkit
39+
python3 -m pip install regex-toolkit
4040
```
4141

4242
Development version from [**GitHub**](https://github.com/Phosmic/regex-toolkit):
@@ -48,26 +48,49 @@ Development version from [**GitHub**](https://github.com/Phosmic/regex-toolkit):
4848

4949

5050
```bash
51-
$ git clone git+https://github.com/Phosmic/regex-toolkit.git
52-
$ cd regex-toolkit
53-
$ python3 -m pip install -e .
51+
git clone git+https://github.com/Phosmic/regex-toolkit.git
52+
cd regex-toolkit
53+
python3 -m pip install -e .
5454
```
5555

5656
---
5757

5858
## Usage
5959

60-
Import packages:
60+
To harness the toolkit's capabilities, you should import the necessary packages:
6161

6262
```python
6363
import re
6464
# and/or
6565
import re2
66+
import regex_toolkit as rtk
6667
```
6768

68-
```python
69-
import regex_toolkit
70-
```
69+
### Why Use `regex_toolkit`?
70+
71+
Regex definitions vary across languages and versions.
72+
By using the toolkit, you can achieve a more consistent and comprehensive representation of unicode support.
73+
It is especially useful to supplement base unicode sets with the latest definitions from other languages and standards.
74+
75+
### RE2 Overview
76+
77+
RE2 focuses on safely processing regular expressions, particularly from untrusted inputs.
78+
It ensures both linear match time and efficient memory usage.
79+
Although it might not always surpass other engines in speed, it intentionally omits features that depend solely on backtracking, like backreferences and look-around assertions.
80+
81+
A brief rundown of RE2 terminology:
82+
83+
- **BitState**: An execution engine that uses backtracking search.
84+
- **bytecode**: The set of instructions that form an automaton.
85+
- **DFA**: The engine for Deterministic Finite Automaton searches.
86+
- **NFA**: Implements the Nondeterministic Finite Automaton search method.
87+
- **OnePass**: A one-pass search execution engine.
88+
- **pattern**: The textual form of a regex.
89+
- **Prog**: The compiled version of a regex.
90+
- **Regexp**: The parsed version of a regex.
91+
- **Rune**: A character in terms of encoding, essentially a code point.
92+
93+
For an in-depth exploration, please refer to the [RE2 documentation](https://github.com/google/re2/wiki/Glossary).
7194

7295
---
7396

@@ -77,6 +100,39 @@ import regex_toolkit
77100

78101
# `regex_toolkit.utils`
79102

103+
<a id="regex_toolkit.utils.resolve_flavor"></a>
104+
105+
#### `resolve_flavor`
106+
107+
```python
108+
def resolve_flavor(potential_flavor: int | RegexFlavor | None) -> RegexFlavor
109+
```
110+
111+
Resolve a regex flavor.
112+
113+
If the flavor is an integer, it is validated and returned.
114+
If the flavor is a RegexFlavor, it is returned.
115+
If the flavor is None, the default flavor is returned. To change the default flavor, set `default_flavor`.
116+
117+
```python
118+
import regex_toolkit as rtk
119+
120+
rtk.base.default_flavor = 2
121+
assert rtk.utils.resolve_flavor(None) == rtk.enums.RegexFlavor.RE2
122+
```
123+
124+
**Arguments**:
125+
126+
- `potential_flavor` _int | RegexFlavor | None_ - Potential regex flavor.
127+
128+
**Returns**:
129+
130+
- _RegexFlavor_ - Resolved regex flavor.
131+
132+
**Raises**:
133+
134+
- `ValueError` - Invalid regex flavor.
135+
80136
<a id="regex_toolkit.utils.iter_sort_by_len"></a>
81137

82138
#### `iter_sort_by_len`
@@ -134,8 +190,8 @@ The codepoint is always 8 characters long (zero-padded).
134190
**Example**:
135191

136192
```python
137-
# Output: '00000061'
138193
ord_to_cpoint(97)
194+
# Output: '00000061'
139195
```
140196

141197
**Arguments**:
@@ -177,8 +233,8 @@ Character to character codepoint.
177233
**Example**:
178234

179235
```python
180-
# Output: '00000061'
181236
char_to_cpoint("a")
237+
# Output: '00000061'
182238
```
183239

184240
**Arguments**:
@@ -201,6 +257,13 @@ Normalize a Unicode string to NFC form C.
201257

202258
Form C favors the use of a fully combined character.
203259

260+
**Example**:
261+
262+
```python
263+
to_nfc("e\\u0301") == "é"
264+
# Output: True
265+
```
266+
204267
**Arguments**:
205268

206269
- `text` _str_ - String to normalize.
@@ -214,39 +277,59 @@ Form C favors the use of a fully combined character.
214277
#### `iter_char_range`
215278

216279
```python
217-
def iter_char_range(first_cpoint: int,
218-
last_cpoint: int) -> Generator[str, None, None]
280+
def iter_char_range(first_char: str,
281+
last_char: str) -> Generator[str, None, None]
219282
```
220283

221-
Iterate all characters within a range of codepoints (inclusive).
284+
Iterate all characters within a range of characters (inclusive).
285+
286+
**Example**:
287+
288+
```python
289+
char_range("a", "c")
290+
# Output: ('a', 'b', 'c')
291+
292+
char_range("c", "a")
293+
# Output: ('c', 'b', 'a')
294+
```
222295

223296
**Arguments**:
224297

225-
- `first_cpoint` _int_ - Starting (first) codepoint.
226-
- `last_cpoint` _int_ - Ending (last) codepoint.
298+
- `first_char` _str_ - Starting (first) character.
299+
- `last_char` _str_ - Ending (last) character.
227300

228301
**Yields**:
229302

230-
- _str_ - Characters within a range of codepoints.
303+
- _str_ - Characters within a range of characters.
231304

232305
<a id="regex_toolkit.utils.char_range"></a>
233306

234307
#### `char_range`
235308

236309
```python
237-
def char_range(first_cpoint: int, last_cpoint: int) -> tuple[str, ...]
310+
def char_range(first_char: str, last_char: str) -> tuple[str, ...]
238311
```
239312

240-
Tuple of all characters within a range of codepoints (inclusive).
313+
Tuple of all characters within a range of characters (inclusive).
314+
315+
**Example**:
316+
317+
```python
318+
char_range("a", "d")
319+
# Output: ('a', 'b', 'c', 'd')
320+
321+
char_range("d", "a")
322+
# Output: ('d', 'c', 'b', 'a')
323+
```
241324

242325
**Arguments**:
243326

244-
- `first_cpoint` _int_ - Starting (first) codepoint.
245-
- `last_cpoint` _int_ - Ending (last) codepoint.
327+
- `first_char` _str_ - Starting (first) character.
328+
- `last_char` _str_ - Ending (last) character.
246329

247330
**Returns**:
248331

249-
- _tuple[str, ...]_ - Characters within a range of codepoints.
332+
- _tuple[str, ...]_ - Characters within a range of characters.
250333

251334
<a id="regex_toolkit.utils.mask_span"></a>
252335

@@ -303,15 +386,15 @@ Todo: Add support for overlapping (and unordered?) spans.
303386
#### `escape`
304387

305388
```python
306-
def escape(char: str, flavor: int = 1) -> str
389+
def escape(char: str, flavor: int | None = None) -> str
307390
```
308391

309392
Create a regex expression that exactly matches a character.
310393

311394
**Arguments**:
312395

313396
- `char` _str_ - Character to match.
314-
- `flavor` _int, optional_ - Regex flavor (1 for RE, 2 for RE2). Defaults to 1.
397+
- `flavor` _int | None, optional_ - Regex flavor (1 for RE, 2 for RE2). Defaults to None.
315398

316399
**Returns**:
317400

@@ -326,15 +409,15 @@ Create a regex expression that exactly matches a character.
326409
#### `string_as_exp`
327410

328411
```python
329-
def string_as_exp(text: str, flavor: int = 1) -> str
412+
def string_as_exp(text: str, flavor: int | None = None) -> str
330413
```
331414

332415
Create a regex expression that exactly matches a string.
333416

334417
**Arguments**:
335418

336419
- `text` _str_ - String to match.
337-
- `flavor` _int, optional_ - Regex flavor (1 for RE, 2 for RE2). Defaults to 1.
420+
- `flavor` _int | None, optional_ - Regex flavor (1 for RE, 2 for RE2). Defaults to None.
338421

339422
**Returns**:
340423

@@ -349,15 +432,15 @@ Create a regex expression that exactly matches a string.
349432
#### `strings_as_exp`
350433

351434
```python
352-
def strings_as_exp(texts: Iterable[str], flavor: int = 1) -> str
435+
def strings_as_exp(texts: Iterable[str], flavor: int | None = None) -> str
353436
```
354437

355438
Create a regex expression that exactly matches any one string.
356439

357440
**Arguments**:
358441

359442
- `texts` _Iterable[str]_ - Strings to match.
360-
- `flavor` _int, optional_ - Regex flavor (1 for RE, 2 for RE2). Defaults to 1.
443+
- `flavor` _int | None, optional_ - Regex flavor (1 for RE, 2 for RE2). Defaults to None.
361444

362445
**Returns**:
363446

@@ -367,6 +450,39 @@ Create a regex expression that exactly matches any one string.
367450

368451
- `ValueError` - Invalid regex flavor.
369452

453+
<a id="regex_toolkit.base.make_exp"></a>
454+
455+
#### `make_exp`
456+
457+
```python
458+
def make_exp(chars: Iterable[str], flavor: int | None = None) -> str
459+
```
460+
461+
Create a regex expression that exactly matches a list of characters.
462+
463+
The characters are sorted and grouped into ranges where possible.
464+
The expression is not anchored, so it can be used as part of a larger expression.
465+
466+
**Example**:
467+
468+
```python
469+
exp = "[" + make_exp(["a", "b", "c", "z", "y", "x"]) + "]"
470+
# Output: '[a-cx-z]'
471+
```
472+
473+
**Arguments**:
474+
475+
- `chars` _Iterable[str]_ - Characters to match.
476+
- `flavor` _int | None, optional_ - Regex flavor (1 for RE, 2 for RE2). Defaults to None.
477+
478+
**Returns**:
479+
480+
- _str_ - Expression that exactly matches the original characters.
481+
482+
**Raises**:
483+
484+
- `ValueError` - Invalid regex flavor.
485+
370486
<a id="regex_toolkit.enums"></a>
371487

372488
# `regex_toolkit.enums`

ci/deps/actions-310.yml

-3
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,5 @@ dependencies:
88
- pytest>=7.0.0
99
- pytest-cov
1010
- pytest-xdist>=2.2.0
11-
# - pytest-asyncio>=0.17
12-
13-
# Required dependencies
1411
- pip:
1512
- google-re2>=1.0

ci/deps/actions-311.yml

-3
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,5 @@ dependencies:
88
- pytest>=7.0.0
99
- pytest-cov
1010
- pytest-xdist>=2.2.0
11-
# - pytest-asyncio>=0.17
12-
13-
# Required dependencies
1411
- pip:
1512
- google-re2>=1.0

docs/templates/install.md.jinja

+4-4
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Most stable version from [**PyPi**](https://pypi.org/project/{{ pypi.name }}/):
55
[![PyPI - License](https://img.shields.io/pypi/l/{{ pypi.name }}?style=flat-square)](https://pypi.org/project/{{ pypi.name }}/)
66

77
```bash
8-
$ python3 -m pip install {{ pypi.name }}
8+
python3 -m pip install {{ pypi.name }}
99
```
1010

1111
Development version from [**GitHub**](https://github.com/{{ repo.owner }}/{{ repo.name }}):
@@ -21,7 +21,7 @@ Development version from [**GitHub**](https://github.com/{{ repo.owner }}/{{ rep
2121
{% endif %}
2222

2323
```bash
24-
$ git clone git+https://github.com/{{ repo.owner }}/{{ repo.name }}.git
25-
$ cd {{ repo.name }}
26-
$ python3 -m pip install -e .
24+
git clone git+https://github.com/{{ repo.owner }}/{{ repo.name }}.git
25+
cd {{ repo.name }}
26+
python3 -m pip install -e .
2727
```

docs/templates/usage.md.jinja

+27-4
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,34 @@
1-
Import packages:
1+
To harness the toolkit's capabilities, you should import the necessary packages:
22

33
```python
44
import re
55
# and/or
66
import re2
7+
import regex_toolkit as rtk
78
```
89

9-
```python
10-
import regex_toolkit
11-
```
10+
### Why Use `regex_toolkit`?
11+
12+
Regex definitions vary across languages and versions.
13+
By using the toolkit, you can achieve a more consistent and comprehensive representation of unicode support.
14+
It is especially useful to supplement base unicode sets with the latest definitions from other languages and standards.
15+
16+
### RE2 Overview
17+
18+
RE2 focuses on safely processing regular expressions, particularly from untrusted inputs.
19+
It ensures both linear match time and efficient memory usage.
20+
Although it might not always surpass other engines in speed, it intentionally omits features that depend solely on backtracking, like backreferences and look-around assertions.
21+
22+
A brief rundown of RE2 terminology:
23+
24+
- **BitState**: An execution engine that uses backtracking search.
25+
- **bytecode**: The set of instructions that form an automaton.
26+
- **DFA**: The engine for Deterministic Finite Automaton searches.
27+
- **NFA**: Implements the Nondeterministic Finite Automaton search method.
28+
- **OnePass**: A one-pass search execution engine.
29+
- **pattern**: The textual form of a regex.
30+
- **Prog**: The compiled version of a regex.
31+
- **Regexp**: The parsed version of a regex.
32+
- **Rune**: A character in terms of encoding, essentially a code point.
33+
34+
For an in-depth exploration, please refer to the [RE2 documentation](https://github.com/google/re2/wiki/Glossary).

0 commit comments

Comments
 (0)