You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Are there any plans or is there any interest in adding support for indexing directly from parquet files?
In my current project, my corpus is provisioned as parquet files, I'm trying to implement building and updating indices from these as the source. Is there any prior work, interest or plans to support this directly in the SDK? If not, would this make for a useful contribution?
From a cursory look, this seems relatively straightforward to add e.g.
There could also be a default transform that works on simple flat schemas and skips nested fields etc.
I understand that this will add the pqarrow dependency even if not being used but that can probably be solved by moving this to a subpackage (probably in contrib) and / or build tags.
Any advice / input is appreciated.
The text was updated successfully, but these errors were encountered:
Are there any plans or is there any interest in adding support for indexing directly from parquet files?
In my current project, my corpus is provisioned as parquet files, I'm trying to implement building and updating indices from these as the source. Is there any prior work, interest or plans to support this directly in the SDK? If not, would this make for a useful contribution?
From a cursory look, this seems relatively straightforward to add e.g.
// index_document.go
type DocumentTransform func(record arrow.Record, index int) (map[string]interface{}, error)
func (i *index) AddDocumentsParquetFromReaderInBatchesWithContext(ctx context.Context, objReader parquet.ReaderAtSeeker, batchSize int, transform DocumentTransform, primaryKey ...string) (resp []TaskInfo, err error) {
...
}
There could also be a default transform that works on simple flat schemas and skips nested fields etc.
I understand that this will add the pqarrow dependency even if not being used but that can probably be solved by moving this to a subpackage (probably in contrib) and / or build tags.
Any advice / input is appreciated.
Unfortunately, adding another dependency will only temporarily solve your problem. It will also cause problems for others. We tried to develop the SDK with the fewest dependencies to minimize interference with other packages.
Are there any plans or is there any interest in adding support for indexing directly from parquet files?
In my current project, my corpus is provisioned as parquet files, I'm trying to implement building and updating indices from these as the source. Is there any prior work, interest or plans to support this directly in the SDK? If not, would this make for a useful contribution?
From a cursory look, this seems relatively straightforward to add e.g.
There could also be a default transform that works on simple flat schemas and skips nested fields etc.
I understand that this will add the pqarrow dependency even if not being used but that can probably be solved by moving this to a subpackage (probably in contrib) and / or build tags.
Any advice / input is appreciated.
The text was updated successfully, but these errors were encountered: