Releases: benbrandt/text-splitter
v0.12.2 - Chunk Overlap
What's New
Support for chunk overlapping: Several of you have been waiting on this for awhile now, and I am happy to say that chunk overlapping is now available in a way that still stays true to the spirit of finding good semantic break points.
When a new chunk is emitted, if chunk overlapping is enabled, the splitter will look back at the semantic sections of the current level and pull in as many as possible that fit within the overlap window. This does mean that none can be taken, which is often the case when close to a higher semantic level boundary.
When it will almost always produce an overlap is when the current semantic level couldn't be fit into a single chunk, and it provides overlapping sections since we may not have found a good break point in the middle of the section. Which seems to be the main motivation for using chunk overlapping in the first place.
Rust Usage
let chunk_config = ChunkConfig::new(256)
// .with_sizer(sizer) // Optional tokenizer or other chunk sizer impl
.with_overlap(64)
.expect("Overlap must be less than desired chunk capacity");
let splitter = TextSplitter::new(chunk_config); // Or MarkdownSplitter
Python Usage
splitter = TextSplitter(256, overlap=64) # or any of the class methods to use a tokenizer
Full Changelog: v0.12.1...v0.12.2
v0.12.1 - rust_tokenizers support
What's Changed
rust_tokenizers
support has been added to the Rust crate in #156
Full Changelog: v0.12.0...v0.12.1
v0.12.0 - Centralized Chunk Configuration
What's New
This release is a big API change to pull all chunk configuration options into the same place, at initialization of the splitters. This was motivated by two things:
- These settings are all important to deciding how to split the text for a given use case, and in practice I saw them often being set together anyway.
- To prep the library for new features like chunk overlap, where error handling has to be introduced to make sure that invariants are kept between all of the settings. These errors should be handled as sson as possible before chunking the text.
Overall, I think this has aligned the library with the usage I have seen in the wild, and pulls all of the settings for the "domain" of chunking into a single unit.
Breaking Changes
Rust
- Trimming is now enabled by default. This brings the Rust crate in alignment with the Python package. But for every use case I saw, this was already being set to
true
, and this does logically make sense as the default behavior. TextSplitter
andMarkdownSplitter
now take aChunkConfig
in their::new
method- This bring the
ChunkSizer
,ChunkCapacity
andtrim
settings into a single struct that can be instantiated with a builder-lite pattern. with_trim_chunks
method has been removed fromTextSplitter
andMarkdownSplitter
. You can now settrim
in theChunkConfig
struct.
- This bring the
ChunkCapacity
is now a struct instead of a Trait. If you were using a customChunkCapacity
, you can change yourimpl
to aFrom<TYPE> for ChunkCapacity
instead. and you should be able to still pass it in to all of the same methods.- This also means
ChunkSizer
s take a concrete type in their method instead of an impl
- This also means
Migration Examples
Default settings:
/// Before
let splitter = TextSplitter::default().with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500);
/// After
let splitter = TextSplitter::new(500);
let chunks = splitter.chunks("your document text");
Hugging Face Tokenizers:
/// Before
let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap();
let splitter = TextSplitter::new(tokenizer).with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500);
/// After
let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap();
let splitter = TextSplitter::new(ChunkConfig::new(500).with_sizer(tokenizer));
let chunks = splitter.chunks("your document text");
Tiktoken:
/// Before
let tokenizer = cl100k_base().unwrap();
let splitter = TextSplitter::new(tokenizer).with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500);
/// After
let tokenizer = cl100k_base().unwrap();
let splitter = TextSplitter::new(ChunkConfig::new(500).with_sizer(tokenizer));
let chunks = splitter.chunks("your document text");
Ranges:
/// Before
let splitter = TextSplitter::default().with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500..2000);
/// After
let splitter = TextSplitter::new(500..2000);
let chunks = splitter.chunks("your document text");
Markdown:
/// Before
let splitter = MarkdownSplitter::default().with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500);
/// After
let splitter = MarkdownSplitter::new(500);
let chunks = splitter.chunks("your document text");
ChunkSizer impls
pub trait ChunkSizer {
/// Before
fn chunk_size(&self, chunk: &str, capacity: &impl ChunkCapacity) -> ChunkSize;
/// After
fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}
ChunkCapacity impls
/// Before
impl ChunkCapacity for Range<usize> {
fn start(&self) -> Option<usize> {
Some(self.start)
}
fn end(&self) -> usize {
self.end.saturating_sub(1).max(self.start)
}
}
/// After
impl From<Range<usize>> for ChunkCapacity {
fn from(range: Range<usize>) -> Self {
ChunkCapacity::new(range.start)
.with_max(range.end.saturating_sub(1).max(range.start))
.expect("invalid range")
}
}
Python
- Chunk
capacity
is now a required arguement in the__init__
and classmethods ofTextSplitter
andMarkdownSplitter
trim_chunks
parameter is now justtrim
in the__init__
and classmethods ofTextSplitter
andMarkdownSplitter
Migration Examples
Default settings:
# Before
splitter = TextSplitter()
chunks = splitter.chunks("your document text", 500)
# After
splitter = TextSplitter(500)
chunks = splitter.chunks("your document text")
Ranges:
# Before
splitter = TextSplitter()
chunks = splitter.chunks("your document text", (200,1000))
# After
splitter = TextSplitter((200,1000))
chunks = splitter.chunks("your document text")
Hugging Face Tokenizers:
# Before
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer)
chunks = splitter.chunks("your document text", 500)
# After
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, 500)
chunks = splitter.chunks("your document text")
Tiktoken:
# Before
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo")
chunks = splitter.chunks("your document text", 500)
# After
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo", 500)
chunks = splitter.chunks("your document text")
Custom callback:
# Before
splitter = TextSplitter.from_callback(lambda text: len(text))
chunks = splitter.chunks("your document text", 500)
# After
splitter = TextSplitter.from_callback(lambda text: len(text), 500)
chunks = splitter.chunks("your document text")
Markdown:
# Before
splitter = MarkdownSplitter()
chunks = splitter.chunks("your document text", 500)
# After
splitter = MarkdownSplitter(500)
chunks = splitter.chunks("your document text")
Full Changelog: v0.11.0...v0.12.0
v0.11.0
Breaking Changes
- Bump tokenizers from 0.15.2 to 0.19.1 by @dependabot in #144 #146
Other updates
- Bump either from 1.10.0 to 1.11.0 by @dependabot in #141
- Bump pyo3 from 0.21.1 to 0.21.2 by @dependabot in #142
Full Changelog: v0.10.0...v0.11.0
v0.10.0
Breaking Changes
Improved (but different) Markdown split points #137. In hindsight, the levels used for determining split points in Markdown text were too granular, which led to some strange split points.
Several element types were consolidated into the same levels, which should still provide a good balance between splitting at the right points and not splitting too often.
Because the output of the MarkdownSplitter
will be substantially different, especially for smaller chunk sizes, this is considered a breaking change.
Full Changelog: v0.9.1...v0.10.0
v0.9.1
What's Changed
Python TextSplitter
and MarkdownSplitter
now both provide a new chunk_indices
method that returns a list not only of chunks, but also their corresponding character offsets relative to the original text. This should allow for different string comparison and matching operations on the chunks.
def chunk_indices(
self, text: str, chunk_capacity: Union[int, Tuple[int, int]]
) -> List[Tuple[int, str]]:
...
A similar method already existed on the Rust side. The key difference is that these offsets are character not byte offsets. For Rust strings, it is usually helpful to have the byte offset, but in Python, most string methods and operations deal with character indices.
by @benbrandt in #135
Full Changelog: v0.9.0...v0.9.1
v0.9.0
What's New
More robust handling of Hugging Face tokenizers as chunk sizers.
- Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.
- In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.
Breaking Changes
There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.
Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.
Full Changelog: v0.8.1...v0.9.0
v0.8.1
v0.8.0 - Performance Improvements
What's New
Significantly fewer allocations necessary when generating chunks. This should result in a performance improvement for most use cases. This was achieved by both reusing pre-allocated collections, as well as memoizing chunk size calculations since that is often the bottleneck, and tokenizer libraries tend to be very allocation heavy!
Benchmarks show:
- 20-40% fewer allocations caused by the core algorithm.
- Up to 20% fewer allocations when using tokenizers to calculate chunk sizes.
- In some cases, especially with Markdown, these improvements can also result in up to 20% faster chunk generation.
Breaking Changes
- There was a bug in the
MarkdownSplitter
logic that caused some strange split points. - The
Text
semantic level inMarkdownSplitter
has been merged with inline elements to also find better split points inside content. - Fixed a bug that could cause the algorithm to use a lower semantic level than necessary on occasion. This mostly impacted the
MarkdownSplitter
, but there were same cases of different behavior in theTextSplitter
as well if chunks are not trimmed.
All of the above can cause different chunks to be output than before, depending on the text. So, even though these are bug fixes to bring intended behavior, they are being treated as a major version bump.
Full Changelog: v0.7.0...v0.8.0
v0.7.0 - Markdown Support
What's New
Markdown Support! Both the Rust crate and Python package have a new MarkdownSplitter
you can use to split markdown text. It leverages the great work of the pulldown-cmark
crate to parse markdown according to the CommonMark spec, and allows for very fine-grained control over how to split the text.
In terms of use, the API is identical to the TextSplitter
, so you should be able to just drop it in when you have Markdown available instead of just plain text.
Rust
use text_splitter::MarkdownSplitter;
// Default implementation uses character count for chunk size.
// Can also use all of the same tokenizer implementations as `TextSplitter`.
let splitter = MarkdownSplitter::default()
// Optionally can also have the splitter trim whitespace for you. It
// will preserve indentation if multiple lines are covered in a chunk.
.with_trim_chunks(true);
let chunks = splitter.chunks("# Header\n\nyour document text", 1000)
Python
from semantic_text_splitter import MarkdownSplitter
# Default implementation uses character count for chunk size.
# Can also use all of the same tokenizer implementations as `TextSplitter`.
# By default it will also have trim whitespace for you.
# It will preserve indentation if multiple lines are covered in a chunk.
splitter = MarkdownSplitter()
chunks = splitter.chunks("# Header\n\nyour document text", 1000)
Breaking Changes
Rust
MSRV is now 1.75.0 since the ability to use impl Trait
in trait methods allowed for much simpler internal APIs to enable the MarkdownSplitter
.
Python
CharacterTextSplitter
, HuggingFaceTextSplitter
, TiktokenTextSplitter
, and CustomTextSplitter
classes have now all been consolidated into a single TextSplitter
class. All of the previous use cases are still supported, you just need to instantiate the class with various class methods.
Below are the changes you need to make to your code to upgrade to v0.7.0:
CharacterTextSplitter
# Before
from semantic_text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter()
# After
from semantic_text_splitter import TextSplitter
splitter = TextSplitter()
HuggingFaceTextSplitter
# Before
from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer)
# After
from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer)
TiktokenTextSplitter
# Before
from semantic_text_splitter import TiktokenTextSplitter
splitter = TiktokenTextSplitter("gpt-3.5-turbo")
# After
from semantic_text_splitter import TextSplitter
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo")
CustomTextSplitter
# Before
from semantic_text_splitter import CustomTextSplitter
splitter = CustomTextSplitter(lambda text: len(text))
# After
from semantic_text_splitter import TextSplitter
splitter = TextSplitter.from_callback(lambda text: len(text))
New Contributors
Full Changelog: v0.6.3...v0.7.0