Skip to content

Releases: benbrandt/text-splitter

v0.16.1

07 Sep 11:27
e53d5e2
Compare
Choose a tag to compare

What's New

Updates pulldown-cmark to v0.12.1 to address an issue with high CPU usage for certain Markdown elements.

Full Changelog: v0.16.0...v0.16.1

v0.16.0

02 Sep 21:32
Compare
Choose a tag to compare

Breaking Changes

  • Update to v0.23.0 of tree-sitter for CodeSplitter. There was a breaking change for language definitions, so this is also a breaking change for us, especially on the Python side, since we support passing the language in.
  • Minimum Python version for the Python bindings is now 3.9 since 3.8 will be EOL next month.

Python

Make sure to upgrade to the latest version of your tree-sitter language package.

Rust

Make sure to upgrade to the latest version of your tree-sitter language package crate. These know have a LANGUAGE constant rather than a language() function.

// Before
tree_sitter_rust::language()
// After
tree_sitter_rust::LANGUAGE

What's New

  • MarkdownSplitter can better parse the Commonmark HS extension for Definition Lists.

Full Changelog: v0.15.0...v0.16.0

v0.15.0

11 Aug 05:21
Compare
Choose a tag to compare

What's New

  • Support version 0.20.0 of the tokenizers crate.

Python

  • No longer cause a segmentation fault when using the wrong type for tree-sitter languages. Fixes #265

Full Changelog: v0.14.1...v0.15.0

v0.14.1

06 Jul 05:38
Compare
Choose a tag to compare

What's New

  • Small performance improvements where checking the size of the chunk is avoided if we already know it is too small or we don't need to. #261
  • Loosen dependency ranges for Rust crates to allow for more flexibility in the versions you can use.

Full Changelog: v0.14.0...v0.14.1

v0.14.0

21 Jun 20:54
Compare
Choose a tag to compare

What's New

Performance fixes for large documents. The worst-case performance for certain documents was abysmal, leading to documents that ran forever. This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space.

For the "happy path", this new approach also led to big speed gains in the CodeSplitter (50%+ speed increase in some cases), marginal regressions in the MarkdownSplitter, and not much difference in the TextSplitter. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously.

Breaking Changes

  • Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the MarkdownSplitter at very small sizes, and any splitter using RustTokenizers because of its offset behavior.

Rust

  • ChunkSize has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway.
  • This makes implementing a custom ChunkSizer much easier, as you now only need to generate the size of the chunk as a usize. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary.

Before

pub trait ChunkSizer {
    // Required method
    fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}

After

pub trait ChunkSizer {
    // Required method
    fn size(&self, chunk: &str) -> usize;
}
  • Optimization for SemanticSplitRange searching by @benbrandt in #219
  • Performance Optimization: Expanding binary search window by @benbrandt in #231

Full Changelog: v0.13.3...v0.14.0

v0.13.3

02 Jun 21:10
Compare
Choose a tag to compare

What's Changed

  • Fixes broken PyPI publish because of a bad dev dependency specification

Full Changelog: v0.13.2...v0.13.3

v0.13.2 - CodeSplitter

02 Jun 20:36
Compare
Choose a tag to compare

What's Changed

New CodeSplitter for splitting code in any languages that tree-sitter grammars are available for. It should provide decent chunks, but please provide feedback if you notice any strange behavior.

Rust Usage

cargo add text-splitter --features code
cargo add tree-sitter-<language>
use text_splitter::CodeSplitter;
// Default implementation uses character count for chunk size.
// Can also use all of the same tokenizer implementations as `TextSplitter`.
let splitter = CodeSplitter::new(tree_sitter_rust::language(), 1000).expect("Invalid tree-sitter language");

let chunks = splitter.chunks("your code file");

Python Usage

from semantic_text_splitter import CodeSplitter
import tree_sitter_python

# Default implementation uses character count for chunk size.
# Can also use all of the same tokenizer implementations as `TextSplitter`.
splitter = CodeSplitter(tree_sitter_python.language(), capacity=1000)

chunks = splitter.chunks("your code file");

Full Changelog: v0.13.1...v0.13.2

v0.13.1

07 May 20:59
Compare
Choose a tag to compare

What's Changed

  • Fix a bug in the fallback logic to make sure we are still respecting the maximum bytes we should be searching in. Again, this only affects Markdown splitting at very small sizes. in #174

Full Changelog: v0.13.0...v0.13.1

v0.13.0

05 May 23:02
Compare
Choose a tag to compare

What's New / Breaking Changes

Unicode Segmentation is now only used as a fallback. This prioritizes the semantic levels of each splitter, and only uses Unicode grapheme/word/sentence segmentation when none of the semantic levels can be split at the desired capacity.

In most cases, this won't change the behavior of the splitter, and will likely mean that speed will improve because it is able to skip several semantic levels at the start, acting as a bisect or binary search, and only go back to the lower levels if it can't fit.

However, for the MarkdownSplitter at very small sizes (i.e., less than 16 tokens), this may produce different output, becuase prior to this change, the splitter may have used Unicode sentence segmentation instead of the Markdown semantic levels, due to an optimization in the level selection. Now, the splitter will prioritize the parsed Markdown levels before it falls back to Unicode segmentation, which preserves better structure at small sizes.

So, it is likely in most cases, this is a non-breaking update. However, if you were using extremely small chunk sizes for Markdown, the behavior is different, and I wanted to inidicate that with a major version bump

Full Changelog: v0.12.3...v0.13.0

v0.12.3

01 May 13:42
Compare
Choose a tag to compare

Bug Fix

Remove leftover dbg! statements in chunk overlap code #154 🤦🏻‍♂️

Apologies if I spammed your logs!

New Contributors

Full Changelog: v0.12.2...v0.12.3