Have consolidate_updates()
reuse existing centroids by default with an option to re-compute them
#177
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
When a user provides information about how to create the centroids in
ingest()
and then later callsconsolidate_updates()
, we will not remember any of the information they provided and will instead revert back to the defaults on how to create centroids iningest()
.Here we add a
reuse_centroids
parameter toconsolidate_updates()
which defaults to true. If true, we will re-use the already computed centroids even after new updates to data. If false, we will re-compute the centroids from scratch, either using the defaults or the user-provided information (i.e. we make all the options about how to compute centroids iningest()
also arguments toconsolidate_updates()
).We also make
partitions
required along withcopy_centroids_uri
. This is because if we've updated the array and so thesize
is larger, we'll end up havingpartitions
be larger and thus try to read from invalid parts ofcopy_centroids_uri
, leading us to end up withnan
's. Example:Isaiah also suggested instead of having to pass
partitions
along withcopy_centroids_uri
, we could instead, under the hood, inspect the non-empty domain region ofcopy_centroids_uri
and compute partitions from that. This PR goes with the explicit approach instead as it doesn't seem too cumbersome and is easier. But happy to switch over if we'd like / we hear it's cumbersome to use like this.Example
Here is the following unit test:
Before you can see
centroids (4, 2)
is different each time we callconsolidate_updates()
:After these changes, you can see
centroids (4, 2)
is the same each time we callconsolidate_updates()
, except when we callconsolidate_updates(reuse_centroids=False)
: