Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lums/tmp/teo/kmeans #147

Merged
merged 3 commits into from
Nov 14, 2023
Merged

lums/tmp/teo/kmeans #147

merged 3 commits into from
Nov 14, 2023

Conversation

lums658
Copy link
Contributor

@lums658 lums658 commented Oct 6, 2023

This PR manually reapplies the various commits in the diverged descendants of teo/kmeans that led to such a disaster in PR #130:

Updates since d085f66:

  • Added code in kmeans to reassign centroids that have small (including empty) partitions. There is now a member variable reassign_ratio_ of the kmeans_index class that will reassign partitions that have size smaller than reassign_ratio_ times the size of the largest partition. The default value of reassign_ratio_ is 0.025 (the ratio of largest to smallest partition will be bounded by a factor of 40).
    • For each centroid c_i to be reassigned, we choose the vector in the training set with the i'th largest score (i.e., the vector farthest away from its own centroid).
    • This helps quite a bit in the case of random initialization. It does not help as much in the case of kmeans++ initialization because the centroids are already well-distributed.
    • Also tried reassignment with random vector, but that did not work as well as reassigning most distance vector.
  • Added code in kmeans to check for convergence. At each iteration it measures how much the centroids have changed since the previous iteration and compares that to the total "potential" of all the centroids. If the delta is less than a tolerance times the total, the kmeans iteration breaks.
  • Created numerous unit tests for various components of the library used by kmeans_index
  • Updated fixed_min_heap to take a comparison function objects (to be able to create fixed max heaps, e.g.)
  • Created qv_partition_with_scores to provide a scores vector to be used in reassignment.

There are a few todos:

  • Parallelize the loop over all thetraining_set vectors in kmeans (Partitions are, well, partitioned among threads. Every thread loops over the entire training_set but only processes vectors from its designated set of partitions.)
  • Implement greedy kmeans++ algorithm, which uses multiple candidate centroids at each iteration and chooses the one that most decreases the potential. This approach is what scikit_learn uses and is reported in the literature to be better than plain kmeans++.

@teo-tsirpanis
Copy link
Member

Random initialization:
sklearn score: 6.619185447692871
tiledb score: 5.118894100189209
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 4.516618251800537

Random initialization:
sklearn score: 6.619184494018555
tiledb score: 5.32197380065918
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 4.792552471160889

Random initialization:
sklearn score: 6.619185447692871
tiledb score: 5.255888938903809
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 4.764191150665283

@ihnorton ihnorton marked this pull request as ready for review October 13, 2023 14:55
@ihnorton ihnorton closed this Nov 14, 2023
@ihnorton ihnorton reopened this Nov 14, 2023
teo-tsirpanis and others added 3 commits November 14, 2023 07:15
Squashed from #147
pick 24f5bb6 Default the shuffled ID and index types of `kmeans_index` to `size_t`.
pick 8fe7850 Enable the k-means initialization tests.
pick 23bf690 Support specifying the seed when creating a `kmeans_index`.
pick ce8745f Avoid randomly choosing the same centroid many times.
pick 0674917 Apply some fixes to the superbuild CMake file from Core.
pick 960e036 Add default values for tolerance and number of threads in `kmeans_index`.
pick c37f20c Start writing the Python kmeans APIs in a separate file.
pick 87983b4 Set internal linkage to some utility functions.
pick 8e33ac4 Fix more duplicate symbol errors.
pick 43ef100 Add a kmeans predict function.
pick 115b8f2 Train the kmeans index in the Python wrapper.
pick c773e2a Use kmeans_fit in the ingestion code instead of sklearn.
pick 455ca20 Fix compile errors and a warning.
pick fcf88f3 More refactorings and use `array_to_matrix`.
pick f35f100 Fix errors in the ingestion.
pick 239a753 Improve a test and diagnostic output.
pick 66de269 Always use floats to train kmeans.
pick fc5c0cf Add more parameters to `kmeans_fit`.
pick 2879be9 Add a test that compares the results of sklearn's and our own kmeans implementation.
pick 94643ce Use kmeans_predict instead of sklearn. This removes the sklearn dependency for good.
pick 45f2852 Use common options across sklearn's and our kmeans implementations.
pick b307de5 Rename `kmeans++` to `k-means++` to match sklearn.
pick 584d548 Assert that the score of the our kmeans implementation is smaller than 1.5 times the score of sklearn's.
pick a7da424 fix transposed args in kmeans.cc -- add unit test [skip ci]
pick 8527303 Test both kmeans++ and random initialization.
pick 6575791 Fix formatting and delete commented code.
pick 34ddcb5 Make the kmeans test more deterministic.
pick 8769d04 Add back the asserts.
pick ef38b0b Add an opt-in switch to use sklearn's kmeans implementation.
Squashed from: #147
pick 697c481 Parameterize min heap with comparison function [skip ci]
pick e5a797a Debug zero cluster fix [skip ci]
pick d085f66 Uncomment debug statements [skip ci]
pick 6b07f17 Initial partition-equalization
pick 9867c90 Updates for kmeans and kmeans++
pick ff71e87 Small update
pick 9176a54 clean up warnings, clang-format
pick e5e3690 Add documentation, update unit tests
pick 76b2fe8 Replace std::abs<size_t> with std::labs
pick 28651e6 Supress std::labs warnings
pick f156dfa Small bug fix in predict
pick d8b0871 clang-format [skip ci]
pick ae5fca4 Add documentation, verify build
@ihnorton ihnorton force-pushed the lums/tmp/teo/kmeans branch from f114d0f to 52fa45d Compare November 14, 2023 14:14
@ihnorton ihnorton force-pushed the lums/tmp/teo/kmeans branch from 27cfab1 to 52fa45d Compare November 14, 2023 16:06
@ihnorton ihnorton merged commit bcfdaa1 into main Nov 14, 2023
8 checks passed
@ihnorton ihnorton deleted the lums/tmp/teo/kmeans branch November 14, 2023 16:22
ihnorton pushed a commit that referenced this pull request Nov 14, 2023
Squashed from #147
pick 24f5bb6 Default the shuffled ID and index types of `kmeans_index` to `size_t`.
pick 8fe7850 Enable the k-means initialization tests.
pick 23bf690 Support specifying the seed when creating a `kmeans_index`.
pick ce8745f Avoid randomly choosing the same centroid many times.
pick 0674917 Apply some fixes to the superbuild CMake file from Core.
pick 960e036 Add default values for tolerance and number of threads in `kmeans_index`.
pick c37f20c Start writing the Python kmeans APIs in a separate file.
pick 87983b4 Set internal linkage to some utility functions.
pick 8e33ac4 Fix more duplicate symbol errors.
pick 43ef100 Add a kmeans predict function.
pick 115b8f2 Train the kmeans index in the Python wrapper.
pick c773e2a Use kmeans_fit in the ingestion code instead of sklearn.
pick 455ca20 Fix compile errors and a warning.
pick fcf88f3 More refactorings and use `array_to_matrix`.
pick f35f100 Fix errors in the ingestion.
pick 239a753 Improve a test and diagnostic output.
pick 66de269 Always use floats to train kmeans.
pick fc5c0cf Add more parameters to `kmeans_fit`.
pick 2879be9 Add a test that compares the results of sklearn's and our own kmeans implementation.
pick 94643ce Use kmeans_predict instead of sklearn. This removes the sklearn dependency for good.
pick 45f2852 Use common options across sklearn's and our kmeans implementations.
pick b307de5 Rename `kmeans++` to `k-means++` to match sklearn.
pick 584d548 Assert that the score of the our kmeans implementation is smaller than 1.5 times the score of sklearn's.
pick a7da424 fix transposed args in kmeans.cc -- add unit test [skip ci]
pick 8527303 Test both kmeans++ and random initialization.
pick 6575791 Fix formatting and delete commented code.
pick 34ddcb5 Make the kmeans test more deterministic.
pick 8769d04 Add back the asserts.
pick ef38b0b Add an opt-in switch to use sklearn's kmeans implementation.
ihnorton pushed a commit that referenced this pull request Nov 14, 2023
Squashed from: #147
pick 697c481 Parameterize min heap with comparison function [skip ci]
pick e5a797a Debug zero cluster fix [skip ci]
pick d085f66 Uncomment debug statements [skip ci]
pick 6b07f17 Initial partition-equalization
pick 9867c90 Updates for kmeans and kmeans++
pick ff71e87 Small update
pick 9176a54 clean up warnings, clang-format
pick e5e3690 Add documentation, update unit tests
pick 76b2fe8 Replace std::abs<size_t> with std::labs
pick 28651e6 Supress std::labs warnings
pick f156dfa Small bug fix in predict
pick d8b0871 clang-format [skip ci]
pick ae5fca4 Add documentation, verify build
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants