Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix training_data bug in test_ingest_with_training_source_uri_tdb and validate training data dimensions in ingest() #175

Merged
merged 2 commits into from
Dec 20, 2023

Conversation

jparismorgan
Copy link
Contributor

What

When I created training_data in the new test_ingest_with_training_source_uri_tdb() test, I accidentally did not slice the original data correctly, as it had already been transposed and so I couldn't just slice it. That was leading to us computing a 2x2 centroid instead of a 4x2 centroid, as you can see here:

(TileDB-Vector-Search) ~/repo/TileDB-Vector-Search/apis/python pytest test/test_ingestion.py -s            ✹main 
============================================== test session starts ===============================================
platform darwin -- Python 3.9.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /Users/parismorgan/repo/TileDB-Vector-Search/apis/python
plugins: nbmake-1.4.6
collected 1 item                                                                                                 

test/test_ingestion.py [test_ingestion@test_ingest_with_training_source_uri_tdb] data (4, 5) 
 [[1.  2.  3.  4.  5. ]
 [1.1 2.1 3.1 4.1 5.1]
 [1.2 2.2 3.2 4.2 5.2]
 [1.3 2.3 3.3 4.3 5.3]]
[test_ingestion@test_ingest_with_training_source_uri_tdb] training_data (2, 4) 
 [[2.  2.1 2.2 2.3]
 [3.  3.1 3.2 3.3]]
[test_ingestion@test_ingest_with_training_source_uri_tdb] ingest() ======================================
[ingestion@ingest] copy_centroids_uri None training_sample_size -1 training_input_vectors None training_source_uri /private/var/folders/jb/5gq49wh97wn0j7hj6zfn9pzh0000gn/T/pytest-of-parismorgan/pytest-1217/test_ingest_with_training_sour0/dataset/training_data.tdb training_source_type None
[ingest@read_source_metadata] schema ArraySchema(
  domain=Domain(*[
    Dim(name='rows', domain=(0, 3), tile=3, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
    Dim(name='cols', domain=(0, 4), tile=3, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='values', dtype='float32', var=False, nullable=False, enum_label=None),
  ],
  cell_order='col-major',
  tile_order='col-major',
  sparse=False,
)

[ivf_flat_index.py@create] dimensions 4 vector_type float32
[ingest@centralised_kmeans] training_sample_size 5 training_source_uri /private/var/folders/jb/5gq49wh97wn0j7hj6zfn9pzh0000gn/T/pytest-of-parismorgan/pytest-1217/test_ingest_with_training_sour0/dataset/training_data.tdb training_source_type None
[ingest@read_source_metadata] schema ArraySchema(
  domain=Domain(*[
    Dim(name='rows', domain=(0, 1), tile=1, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
    Dim(name='cols', domain=(0, 3), tile=3, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='values', dtype='float32', var=False, nullable=False, enum_label=None),
  ],
  cell_order='col-major',
  tile_order='col-major',
  sparse=False,
)

[ingest@centralised_kmeans] reading from training_source_uri: training_source_uri /private/var/folders/jb/5gq49wh97wn0j7hj6zfn9pzh0000gn/T/pytest-of-parismorgan/pytest-1217/test_ingest_with_training_sour0/dataset/training_data.tdb training_source_type TILEDB_ARRAY training_in_size 4 training_dimensions 2 training_vector_type float32
[ingestion@centralized_kmeans] sample_vectors (4, 2) 
 [[2.  3. ]
 [2.1 3.1]
 [2.2 3.2]
 [2.3 3.3]] 
centroids (2, 2) 
 [[2.05 2.25]
 [3.05 3.25]]

This PR fixes this issue with training_data and also adds a check into the code so we raise an exception if this happens. Here is the same test now working correctly (you are looking for centroids at the bottom):

(TileDB-Vector-Search) ~/repo/TileDB-Vector-Search/apis/python pytest test/test_ingestion.py -s       2 ↵  ✹main 
============================================== test session starts ===============================================
platform darwin -- Python 3.9.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /Users/parismorgan/repo/TileDB-Vector-Search/apis/python
plugins: nbmake-1.4.6
collected 1 item                                                                                                 

test/test_ingestion.py [test_ingestion@test_ingest_with_training_source_uri_tdb] data (4, 5) 
 [[1.  2.  3.  4.  5. ]
 [1.1 2.1 3.1 4.1 5.1]
 [1.2 2.2 3.2 4.2 5.2]
 [1.3 2.3 3.3 4.3 5.3]]
[test_ingestion@test_ingest_with_training_source_uri_tdb] training_data (4, 2) 
 [[1.  2. ]
 [1.1 2.1]
 [1.2 2.2]
 [1.3 2.3]]
[test_ingestion@test_ingest_with_training_source_uri_tdb] ingest() ======================================
[ingestion@ingest] copy_centroids_uri None training_sample_size -1 training_input_vectors None training_source_uri /private/var/folders/jb/5gq49wh97wn0j7hj6zfn9pzh0000gn/T/pytest-of-parismorgan/pytest-1218/test_ingest_with_training_sour0/dataset/training_data.tdb training_source_type None
[ingest@read_source_metadata] schema ArraySchema(
  domain=Domain(*[
    Dim(name='rows', domain=(0, 3), tile=3, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
    Dim(name='cols', domain=(0, 4), tile=3, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='values', dtype='float32', var=False, nullable=False, enum_label=None),
  ],
  cell_order='col-major',
  tile_order='col-major',
  sparse=False,
)

[ivf_flat_index.py@create] dimensions 4 vector_type float32
[ingest@centralised_kmeans] training_sample_size 5 training_source_uri /private/var/folders/jb/5gq49wh97wn0j7hj6zfn9pzh0000gn/T/pytest-of-parismorgan/pytest-1218/test_ingest_with_training_sour0/dataset/training_data.tdb training_source_type None
[ingest@read_source_metadata] schema ArraySchema(
  domain=Domain(*[
    Dim(name='rows', domain=(0, 3), tile=3, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
    Dim(name='cols', domain=(0, 1), tile=1, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='values', dtype='float32', var=False, nullable=False, enum_label=None),
  ],
  cell_order='col-major',
  tile_order='col-major',
  sparse=False,
)

[ingest@centralised_kmeans] reading from training_source_uri: training_source_uri /private/var/folders/jb/5gq49wh97wn0j7hj6zfn9pzh0000gn/T/pytest-of-parismorgan/pytest-1218/test_ingest_with_training_sour0/dataset/training_data.tdb training_source_type TILEDB_ARRAY training_in_size 2 training_dimensions 4 training_vector_type float32
[ingestion@centralized_kmeans] sample_vectors (2, 4) 
 [[1.  1.1 1.2 1.3]
 [2.  2.1 2.2 2.3]] 
centroids (4, 2) 
 [[1.5       0.       ]
 [1.5999999 0.       ]
 [1.7       0.       ]
 [1.8       0.       ]]
.
Screenshot 2023-12-20 at 12 19 25 PM

Testing

  • Manual test as shown above
  • Adds checks to unit tests that we can hit this scenario

Note

I also added a few checks to inputs to ingest() - they same safe and potentially useful.

@jparismorgan jparismorgan marked this pull request as ready for review December 20, 2023 11:50
@jparismorgan jparismorgan merged commit f5734f9 into main Dec 20, 2023
4 checks passed
@jparismorgan jparismorgan deleted the jparismorgan/ingest-training-data-bug branch December 20, 2023 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants