Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata to chunk #14

Open
gordoneliel opened this issue Mar 24, 2024 · 4 comments
Open

Add metadata to chunk #14

gordoneliel opened this issue Mar 24, 2024 · 4 comments
Labels
question Further information is requested wontfix This will not be worked on

Comments

@gordoneliel
Copy link

When splitting up documents, its helpful to add for example the title of the document that a chunk was extracted from.

The splitter would take in some optional meta that would be passed down:

opts = [chunk_size: 10, chunk_overlap: 5, metadata: %{doc_title: "My doc"}]
chunks = TextChunker.split(text, opts)

Then each chunk would inherit the metadata.

%TextChunk{
 ...existing_props,
metadata: %{doc_title: "My doc"}
}

Another option would be to add a title/label prop instead of metadata.

What are your thoughts on this? How are you currently adding info to the chunks you're splitting?

@stuartjohnpage
Copy link
Collaborator

After we've split the text, we reduce over the produced chunks. Inside the reduce, we figure out the start and end bytes, and put that information on the chunk in question.

I can absolutely see the value in a 'metadata', 'label', 'title' or some other 'custom properties' field! Especially if the chunker is being used in a RAG flow. @estreeper @gk-per what do you think?

@gk-per
Copy link

gk-per commented Apr 4, 2024

  • Metadata - I like the name metadata for the custom fields. Only other one I can think of is "payload".
  • Map structure is good for storing custom metadata.
  • I think it should be just an empty map if there is no metadata set.

@grossvogel
Copy link

grossvogel commented Apr 4, 2024

Hey, @gordoneliel, can you give us a better idea of the problem you're looking to solve here so we can understand how it fits into the feature set and interface of the library? Maybe some sample code of what you're hoping to achieve?

So you know where I'm coming from: In the interest of keeping the library's API and feature set as narrow as possible, I'm inclined toward patterns like this rather than trying to couple the library's output too closely to a specific app's needs:

document.text
|> TextChunker.split()
|> Enum.map(fn chunk ->
  %OurOwnChunkStruct{
    document_id: document.id,
    text: chunk.text,
    start_byte: chunk.start_byte,
    #...
  }
end)

But I'm very interested in how that may or may not fit your use case and totally open to expanding the functionality if it'll move the needle for you!

@stuartjohnpage stuartjohnpage added question Further information is requested wontfix This will not be worked on labels Apr 18, 2024
@cpursley
Copy link
Contributor

This would also be useful for storing say, the page number, of a chunked PDF.

Here's a pretty interesting project where I suggested chunking by content instead of page (but where page # should still need to be tracked): toranb/rag-n-drop#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants