Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ToList as Memory<T> #17

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 19 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,22 @@ We use the official Unicode [test suites](https://unicode.org/reports/tr41/tr41-

[![.NET](https://github.com/clipperhouse/uax29.net/actions/workflows/dotnet.yml/badge.svg)](https://github.com/clipperhouse/uax29.net/actions/workflows/dotnet.yml)

This is the same algorithm that is implemented in Lucene's [StandardTokenizer](https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html).

### Major version changes

If you are using v1.x of this package, v2 has been renamed:

`dotnet add package uax29.net` → `dotnet add package UAX29`

`using uax29` → `using UAX29`

We now use extension methods:

`Tokenizer.Create(input)` → `input.GetWords()`

`Tokenizer.Create(input, TokenType.Graphemes)` → `input.GetGraphemes()`

### Performance

When tokenizing words, I get around 100MB/s on my Macbook M2. For typical text, that's around 25MM tokens/s. [Benchmarks](https://github.com/clipperhouse/uax29.net/tree/main/Benchmarks)
Expand All @@ -105,7 +121,7 @@ The tokenizer is implemented as a `ref struct`, so you should see zero allocatio

Calling `GetWords` et al returns a lazy enumerator, and will not allocate per-token. There are `ToList` and `ToArray` methods for convenience, which will allocate.

For `Stream` or `TextReader`/`StreamReader`, a buffer needs to be allocated behind the scenes. You can specify the size when calling `GetWords`. You can re-use the buffer by calling `SetStream` on an existing tokenizer, which will avoid re-allocation.
For `Stream` or `TextReader`/`StreamReader`, a buffer needs to be allocated behind the scenes. You can specify the size when calling `GetWords`. You can also optionally pass your own `byte[]` or `char[]` to do your own allocation, perhaps with [ArrayPool](https://learn.microsoft.com/en-us/dotnet/api/system.buffers.arraypool-1). Or, you can re-use the buffer by calling `SetStream` on an existing tokenizer, which will avoid re-allocation.

### Invalid inputs

Expand All @@ -123,10 +139,10 @@ The .Net Core standard library has a similar enumerator for graphemes.

### Other language implementations

[Java](https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html)

[JavaScript](https://github.com/tc39/proposal-intl-segmenter)

[Rust](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/trait.UnicodeSegmentation.html)

[Java](https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/analysis/standard/StandardTokenizerImpl.html)

[Python](https://uniseg-python.readthedocs.io/en/latest/)
22 changes: 19 additions & 3 deletions uax29/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,22 @@ We use the official Unicode [test suites](https://unicode.org/reports/tr41/tr41-

[![.NET](https://github.com/clipperhouse/uax29.net/actions/workflows/dotnet.yml/badge.svg)](https://github.com/clipperhouse/uax29.net/actions/workflows/dotnet.yml)

This is the same algorithm that is implemented in Lucene's [StandardTokenizer](https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html).

### Major version changes

If you are using v1.x of this package, v2 has been renamed:

`dotnet add package uax29.net` → `dotnet add package UAX29`

`using uax29` → `using UAX29`

We now use extension methods:

`Tokenizer.Create(input)` → `input.GetWords()`

`Tokenizer.Create(input, TokenType.Graphemes)` → `input.GetGraphemes()`

### Performance

When tokenizing words, I get around 100MB/s on my Macbook M2. For typical text, that's around 25MM tokens/s. [Benchmarks](https://github.com/clipperhouse/uax29.net/tree/main/Benchmarks)
Expand All @@ -105,7 +121,7 @@ The tokenizer is implemented as a `ref struct`, so you should see zero allocatio

Calling `GetWords` et al returns a lazy enumerator, and will not allocate per-token. There are `ToList` and `ToArray` methods for convenience, which will allocate.

For `Stream` or `TextReader`/`StreamReader`, a buffer needs to be allocated behind the scenes. You can specify the size when calling `GetWords`. You can re-use the buffer by calling `SetStream` on an existing tokenizer, which will avoid re-allocation.
For `Stream` or `TextReader`/`StreamReader`, a buffer needs to be allocated behind the scenes. You can specify the size when calling `GetWords`. You can also optionally pass your own `byte[]` or `char[]` to do your own allocation, perhaps with [ArrayPool](https://learn.microsoft.com/en-us/dotnet/api/system.buffers.arraypool-1). Or, you can re-use the buffer by calling `SetStream` on an existing tokenizer, which will avoid re-allocation.

### Invalid inputs

Expand All @@ -123,10 +139,10 @@ The .Net Core standard library has a similar enumerator for graphemes.

### Other language implementations

[Java](https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html)

[JavaScript](https://github.com/tc39/proposal-intl-segmenter)

[Rust](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/trait.UnicodeSegmentation.html)

[Java](https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/analysis/standard/StandardTokenizerImpl.html)

[Python](https://uniseg-python.readthedocs.io/en/latest/)
4 changes: 2 additions & 2 deletions uax29/Tokenizer.Test.cs
Original file line number Diff line number Diff line change
Expand Up @@ -317,7 +317,7 @@ public void ToList()
var i = 0;
foreach (var token in tokens)
{
Assert.That(token.SequenceEqual(list[i]));
Assert.That(token.SequenceEqual(list[i].Span));
i++;
}

Expand Down Expand Up @@ -350,7 +350,7 @@ public void ToArray()
var i = 0;
foreach (var token in tokens)
{
Assert.That(token.SequenceEqual(array[i]));
Assert.That(token.SequenceEqual(array[i].Span));
i++;
}

Expand Down
28 changes: 15 additions & 13 deletions uax29/Tokenizer.cs
Original file line number Diff line number Diff line change
Expand Up @@ -98,35 +98,37 @@ public void SetText(ReadOnlySpan<T> input)
}

/// <summary>
/// Iterates over all tokens and collects them into a list, allocating a new array for each token.
/// Iterates over all tokens and collects them into a list. A new underlying array is allocated, and original input data is copied.
/// </summary>
/// <returns>List<byte[]> or List<char[]>, depending on the input</returns>
public List<T[]> ToList()
/// <returns>List<ReadOnlyMemory<byte>> or List<ReadOnlyMemory<byte>>, depending on the input.</returns>
public readonly List<ReadOnlyMemory<T>> ToList()
{
if (begun)
{
throw new InvalidOperationException("ToList must not be called after iteration has begun. You may wish to call Reset() on the tokenizer.");
}

var result = new List<T[]>();
foreach (var token in this)
var copy = this.input.ToArray();
var tokenizer = new Tokenizer<T>(copy, this.split);

var list = new List<ReadOnlyMemory<T>>();
foreach (var token in tokenizer)
{
result.Add(token.ToArray());
ReadOnlyMemory<T> mem = token.ToArray();
list.Add(mem);
}

this.Reset();
return result;
return list;
}

/// <summary>
/// Iterates over all tokens and collects them into an array, allocating a new array for each token.
/// Iterates over all tokens and collects them into an array. A new underlying array is allocated, and original input data is copied.
/// </summary>
/// <returns>byte[][] or char[][], depending on the input</returns>
public T[][] ToArray()
/// <returns>ReadOnlyMemory<byte>[] or ReadOnlyMemory<byte>[], depending on the input.</returns>
public readonly ReadOnlyMemory<T>[] ToArray()
{
if (begun)
{
throw new InvalidOperationException("ToArray must not be called after iteration has begun. You may wish to call Reset() on the tokenizer.");
throw new InvalidOperationException("ToList must not be called after iteration has begun. You may wish to call Reset() on the tokenizer.");
}

return this.ToList().ToArray();
Expand Down