-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata to chunk #14
Comments
After we've split the text, we I can absolutely see the value in a 'metadata', 'label', 'title' or some other 'custom properties' field! Especially if the chunker is being used in a RAG flow. @estreeper @gk-per what do you think? |
|
Hey, @gordoneliel, can you give us a better idea of the problem you're looking to solve here so we can understand how it fits into the feature set and interface of the library? Maybe some sample code of what you're hoping to achieve? So you know where I'm coming from: In the interest of keeping the library's API and feature set as narrow as possible, I'm inclined toward patterns like this rather than trying to couple the library's output too closely to a specific app's needs: document.text
|> TextChunker.split()
|> Enum.map(fn chunk ->
%OurOwnChunkStruct{
document_id: document.id,
text: chunk.text,
start_byte: chunk.start_byte,
#...
}
end) But I'm very interested in how that may or may not fit your use case and totally open to expanding the functionality if it'll move the needle for you! |
This would also be useful for storing say, the page number, of a chunked PDF. Here's a pretty interesting project where I suggested chunking by content instead of page (but where page # should still need to be tracked): toranb/rag-n-drop#1 |
When splitting up documents, its helpful to add for example the title of the document that a chunk was extracted from.
The splitter would take in some optional meta that would be passed down:
opts = [chunk_size: 10, chunk_overlap: 5, metadata: %{doc_title: "My doc"}]
chunks = TextChunker.split(text, opts)
Then each chunk would inherit the metadata.
Another option would be to add a title/label prop instead of metadata.
What are your thoughts on this? How are you currently adding info to the chunks you're splitting?
The text was updated successfully, but these errors were encountered: