-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lums/tmp/teo/kmeans #147
Merged
Merged
lums/tmp/teo/kmeans #147
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
Squashed from #147 pick 24f5bb6 Default the shuffled ID and index types of `kmeans_index` to `size_t`. pick 8fe7850 Enable the k-means initialization tests. pick 23bf690 Support specifying the seed when creating a `kmeans_index`. pick ce8745f Avoid randomly choosing the same centroid many times. pick 0674917 Apply some fixes to the superbuild CMake file from Core. pick 960e036 Add default values for tolerance and number of threads in `kmeans_index`. pick c37f20c Start writing the Python kmeans APIs in a separate file. pick 87983b4 Set internal linkage to some utility functions. pick 8e33ac4 Fix more duplicate symbol errors. pick 43ef100 Add a kmeans predict function. pick 115b8f2 Train the kmeans index in the Python wrapper. pick c773e2a Use kmeans_fit in the ingestion code instead of sklearn. pick 455ca20 Fix compile errors and a warning. pick fcf88f3 More refactorings and use `array_to_matrix`. pick f35f100 Fix errors in the ingestion. pick 239a753 Improve a test and diagnostic output. pick 66de269 Always use floats to train kmeans. pick fc5c0cf Add more parameters to `kmeans_fit`. pick 2879be9 Add a test that compares the results of sklearn's and our own kmeans implementation. pick 94643ce Use kmeans_predict instead of sklearn. This removes the sklearn dependency for good. pick 45f2852 Use common options across sklearn's and our kmeans implementations. pick b307de5 Rename `kmeans++` to `k-means++` to match sklearn. pick 584d548 Assert that the score of the our kmeans implementation is smaller than 1.5 times the score of sklearn's. pick a7da424 fix transposed args in kmeans.cc -- add unit test [skip ci] pick 8527303 Test both kmeans++ and random initialization. pick 6575791 Fix formatting and delete commented code. pick 34ddcb5 Make the kmeans test more deterministic. pick 8769d04 Add back the asserts. pick ef38b0b Add an opt-in switch to use sklearn's kmeans implementation.
Squashed from: #147 pick 697c481 Parameterize min heap with comparison function [skip ci] pick e5a797a Debug zero cluster fix [skip ci] pick d085f66 Uncomment debug statements [skip ci] pick 6b07f17 Initial partition-equalization pick 9867c90 Updates for kmeans and kmeans++ pick ff71e87 Small update pick 9176a54 clean up warnings, clang-format pick e5e3690 Add documentation, update unit tests pick 76b2fe8 Replace std::abs<size_t> with std::labs pick 28651e6 Supress std::labs warnings pick f156dfa Small bug fix in predict pick d8b0871 clang-format [skip ci] pick ae5fca4 Add documentation, verify build
ihnorton
force-pushed
the
lums/tmp/teo/kmeans
branch
from
November 14, 2023 14:14
f114d0f
to
52fa45d
Compare
ihnorton
approved these changes
Nov 14, 2023
ihnorton
force-pushed
the
lums/tmp/teo/kmeans
branch
from
November 14, 2023 16:06
27cfab1
to
52fa45d
Compare
ihnorton
pushed a commit
that referenced
this pull request
Nov 14, 2023
Squashed from #147 pick 24f5bb6 Default the shuffled ID and index types of `kmeans_index` to `size_t`. pick 8fe7850 Enable the k-means initialization tests. pick 23bf690 Support specifying the seed when creating a `kmeans_index`. pick ce8745f Avoid randomly choosing the same centroid many times. pick 0674917 Apply some fixes to the superbuild CMake file from Core. pick 960e036 Add default values for tolerance and number of threads in `kmeans_index`. pick c37f20c Start writing the Python kmeans APIs in a separate file. pick 87983b4 Set internal linkage to some utility functions. pick 8e33ac4 Fix more duplicate symbol errors. pick 43ef100 Add a kmeans predict function. pick 115b8f2 Train the kmeans index in the Python wrapper. pick c773e2a Use kmeans_fit in the ingestion code instead of sklearn. pick 455ca20 Fix compile errors and a warning. pick fcf88f3 More refactorings and use `array_to_matrix`. pick f35f100 Fix errors in the ingestion. pick 239a753 Improve a test and diagnostic output. pick 66de269 Always use floats to train kmeans. pick fc5c0cf Add more parameters to `kmeans_fit`. pick 2879be9 Add a test that compares the results of sklearn's and our own kmeans implementation. pick 94643ce Use kmeans_predict instead of sklearn. This removes the sklearn dependency for good. pick 45f2852 Use common options across sklearn's and our kmeans implementations. pick b307de5 Rename `kmeans++` to `k-means++` to match sklearn. pick 584d548 Assert that the score of the our kmeans implementation is smaller than 1.5 times the score of sklearn's. pick a7da424 fix transposed args in kmeans.cc -- add unit test [skip ci] pick 8527303 Test both kmeans++ and random initialization. pick 6575791 Fix formatting and delete commented code. pick 34ddcb5 Make the kmeans test more deterministic. pick 8769d04 Add back the asserts. pick ef38b0b Add an opt-in switch to use sklearn's kmeans implementation.
ihnorton
pushed a commit
that referenced
this pull request
Nov 14, 2023
Squashed from: #147 pick 697c481 Parameterize min heap with comparison function [skip ci] pick e5a797a Debug zero cluster fix [skip ci] pick d085f66 Uncomment debug statements [skip ci] pick 6b07f17 Initial partition-equalization pick 9867c90 Updates for kmeans and kmeans++ pick ff71e87 Small update pick 9176a54 clean up warnings, clang-format pick e5e3690 Add documentation, update unit tests pick 76b2fe8 Replace std::abs<size_t> with std::labs pick 28651e6 Supress std::labs warnings pick f156dfa Small bug fix in predict pick d8b0871 clang-format [skip ci] pick ae5fca4 Add documentation, verify build
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR manually reapplies the various commits in the diverged descendants of teo/kmeans that led to such a disaster in PR #130:
Updates since d085f66:
kmeans
to reassign centroids that have small (including empty) partitions. There is now a member variablereassign_ratio_
of thekmeans_index
class that will reassign partitions that have size smaller thanreassign_ratio_
times the size of the largest partition. The default value ofreassign_ratio_
is 0.025 (the ratio of largest to smallest partition will be bounded by a factor of 40).kmeans
to check for convergence. At each iteration it measures how much the centroids have changed since the previous iteration and compares that to the total "potential" of all the centroids. If the delta is less than a tolerance times the total, the kmeans iteration breaks.kmeans_index
fixed_min_heap
to take a comparison function objects (to be able to create fixed max heaps, e.g.)qv_partition_with_scores
to provide a scores vector to be used in reassignment.There are a few todos:
training_set
vectors inkmeans
(Partitions are, well, partitioned among threads. Every thread loops over the entiretraining_set
but only processes vectors from its designated set of partitions.)