Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ k-means implementation #130

Closed
wants to merge 44 commits into from
Closed

C++ k-means implementation #130

wants to merge 44 commits into from

Conversation

teo-tsirpanis
Copy link
Member

@teo-tsirpanis teo-tsirpanis changed the title C++ K-means implementation C++ k-means implementation Aug 24, 2023
@teo-tsirpanis teo-tsirpanis marked this pull request as ready for review September 4, 2023 21:00
@teo-tsirpanis
Copy link
Member Author

Random initialization:
sklearn score: 6.619185447692871
tiledb score: 6.346259593963623
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 5.087535381317139

Random initialization:
sklearn score: 6.619184494018555
tiledb score: 6.804962635040283
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 4.960258483886719

Random initialization:
sklearn score: 6.619184494018555
tiledb score: 6.795690536499023
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 4.670709133148193

@lums658
Copy link
Contributor

lums658 commented Sep 15, 2023

Random initialization:
sklearn score: 6.619185447692871
tiledb score: 6.346259593963623
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 5.087535381317139

Random initialization:
sklearn score: 6.619184494018555
tiledb score: 6.804962635040283
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 4.960258483886719

Random initialization:
sklearn score: 6.619184494018555
tiledb score: 6.795690536499023
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 4.670709133148193

Interesting -- the sklearn kmeans++ seems deterministic -- same score every time whereas the tiledb score varies quite a bit. I'll take a look.

@teo-tsirpanis
Copy link
Member Author

Random initialization:
sklearn score: 6.619184494018555
tiledb score: 6.8692545890808105
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 5.078083515167236

Random initialization:
sklearn score: 6.619184494018555
tiledb score: 6.6323323249816895
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 4.911060810089111

Random initialization:
sklearn score: 6.619185447692871
tiledb score: 6.7284345626831055
K-means++:
sklearn score: 3.9469563961029053
tiledb score: 4.608294486999512

@lums658
Copy link
Contributor

lums658 commented Oct 6, 2023

Updates:

  • Added code in kmeans to reassign centroids that have small (including empty) partitions. There is now a member variable reassign_ratio_ of the kmeans_index class that will reassign partitions that have size smaller than reassign_ratio_ times the size of the largest partition. The default value of reassign_ratio_ is 0.025 (the ratio of largest to smallest partition will be bounded by a factor of 40).
    • For each centroid c_i to be reassigned, we choose the vector in the training set with the i'th largest score (i.e., the vector farthest away from its own centroid).
    • This helps quite a bit in the case of random initialization. It does not help as much in the case of kmeans++ initialization because the centroids are already well-distributed.
    • Also tried reassignment with random vector, but that did not work as well as reassigning most distance vector.
  • Added code in kmeans to check for convergence. At each iteration it measures how much the centroids have changed since the previous iteration and compares that to the total "potential" of all the centroids. If the delta is less than a tolerance times the total, the kmeans iteration breaks.
  • Created numerous unit tests for various components of the library used by kmeans_index
  • Updated fixed_min_heap to take a comparison function objects (to be able to create fixed max heaps, e.g.)
  • Created qv_partition_with_scores to provide a scores vector to be used in reassignment.

There are a few todos:

  • Parallelize the loop over all thetraining_set vectors in kmeans (Partitions are, well, partitioned among threads. Every thread loops over the entire training_set but only processes vectors from its designated set of partitions.)
  • Implement greedy kmeans++ algorithm, which uses multiple candidate centroids at each iteration and chooses the one that most decreases the potential. This approach is what scikit_learn uses and is reported in the literature to be better than plain kmeans++.

@lums658 lums658 mentioned this pull request Oct 6, 2023
@teo-tsirpanis
Copy link
Member Author

Closing, #147 was merged.

@teo-tsirpanis teo-tsirpanis deleted the teo/kmeans branch November 14, 2023 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants