-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some robustifying fixes on Atol fit and tests for vectorizers #1096
Changes from 4 commits
a4576e5
c389eea
c72d4ee
4fabf81
632a55b
03c61bb
442f401
bd99872
7338eaf
9a94b4e
88a7a16
5e17323
f7b14d8
040fb54
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -807,20 +807,28 @@ def fit(self, X, y=None, sample_weight=None): | |
self | ||
""" | ||
if not hasattr(self.quantiser, 'fit'): | ||
raise TypeError("quantiser %s has no `fit` attribute." % (self.quantiser)) | ||
|
||
# In fitting we remove infinite death time points so that every center is finite | ||
X = [dgm[~np.isinf(dgm).any(axis=1), :] for dgm in X] | ||
raise TypeError(f"quantiser {self.quantiser} has no `fit` attribute.") | ||
|
||
if sample_weight is None: | ||
sample_weight = [self.get_weighting_method()(measure) for measure in X] | ||
|
||
measures_concat = np.concatenate(X) | ||
weights_concat = np.concatenate(sample_weight) | ||
|
||
self.quantiser.fit(X=measures_concat, sample_weight=weights_concat) | ||
|
||
# In fitting we remove infinite birth/death time points so that every center is finite | ||
filtered_measures_concat = measures_concat[~np.isinf(measures_concat).any(axis=1), :] if len(measures_concat) else measures_concat | ||
filtered_weights_concat = weights_concat[~np.isinf(measures_concat).any(axis=1)] if len(measures_concat) else weights_concat | ||
|
||
n_clusters = self.quantiser.n_clusters | ||
n_points = len(filtered_measures_concat) | ||
if n_points < n_clusters: | ||
# If not enough points to fit (including 0), let's arbitrarily put centers in [0, 1)^2 | ||
print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding random points in [0, 1)^2.") | ||
martinroyer marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait, is Atol restricted to 2D? I thought it wasn't restricted to diagrams and could work in any dimension. Though if we fit on no points, I don't know how to guess a dimension for the random centers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is not, this is a hard one! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is still the possibility of giving up (i.e. raise an exception if there are no points at all). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have now favored and implemented this approach! |
||
filtered_weights_concat = np.concatenate((filtered_weights_concat, np.ones(shape=(n_clusters - n_points)))) | ||
filtered_measures_concat = np.concatenate((filtered_measures_concat, np.random.random((n_clusters - n_points, 2)))) | ||
martinroyer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
self.quantiser.fit(X=filtered_measures_concat, sample_weight=filtered_weights_concat) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When the number of points equals the number of requested clusters, are clustering algorithms usually good at finding all of the points as centers? Another possibility, in the "too few points" case, would be to write to self.centers directly and skip the use of the quantiser. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First of all thanks for all the great remarks.
Will work on that! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think there is a function that begins sorting or hash-tabling and stops once it has found k distinct elements... I think the common case will be that the first Actually, do we really care? Having a duplicated feature is not much worse than having a feature that is always 0, and this is only for a degenerate case that really shouldn't happen anyway, so I don't think we should make too much effort beyond "warn and don't crash". I only mentioned the possibility of duplicate points in passing, but it does not bother me.
As you like. Unless someone else chimes in, I don't really care what points you use as filler. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that we do not really care, so I have dropped this! |
||
self.centers = self.quantiser.cluster_centers_ | ||
|
||
# Hack, but some people are unhappy if the order depends on the version of sklearn | ||
self.centers = self.centers[np.lexsort(self.centers.T)] | ||
if self.quantiser.n_clusters == 1: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
from copy import deepcopy | ||
martinroyer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
import numpy as np | ||
|
||
from sklearn.cluster import KMeans | ||
|
||
from gudhi.representations import (Atol, Landscape, Silhouette, BettiCurve, ComplexPolynomial, \ | ||
TopologicalVector, PersistenceImage, Entropy) | ||
|
||
vectorizers = { | ||
"atol": Atol(quantiser=KMeans(n_clusters=2, random_state=202312, n_init="auto")), | ||
# "betti": BettiCurve(), | ||
} | ||
|
||
diag1 = [np.array([[0., np.inf], | ||
[0., 8.94427191], | ||
[0., 7.28010989], | ||
[0., 6.08276253], | ||
[0., 5.83095189], | ||
[0., 5.38516481], | ||
[0., 5.]]), | ||
np.array([[11., np.inf], | ||
[6.32455532, 6.70820393]]), | ||
np.empty(shape=[0, 2])] | ||
|
||
diag2 = [np.array([[0., np.inf], | ||
[0., 8.94427191], | ||
[0., 7.28010989], | ||
[0., 6.08276253], | ||
[0., 5.83095189], | ||
[0., 5.38516481], | ||
[0., 5.]]), | ||
np.array([[11., np.inf], | ||
[6.32455532, 6.70820393]]), | ||
np.array([[0., np.inf], | ||
[0., 1]])] | ||
|
||
diag3 = [np.empty(shape=[0, 2])] | ||
|
||
|
||
mglisse marked this conversation as resolved.
Show resolved
Hide resolved
|
||
def test_fit(): | ||
print(f" > Testing `fit`.") | ||
for name, vectorizer in vectorizers.items(): | ||
print(f" >> Testing {name}") | ||
deepcopy(vectorizer).fit(X=[diag1[0], diag2[0]]) | ||
|
||
|
||
def test_fit_empty(): | ||
print(f" > Testing `fit_empty`.") | ||
for name, vectorizer in vectorizers.items(): | ||
print(f" >> Testing {name}") | ||
deepcopy(vectorizer).fit(X=[diag3[0], diag3[0]]) | ||
|
||
|
||
def test_transform(): | ||
print(f" > Testing `transform`.") | ||
for name, vectorizer in vectorizers.items(): | ||
print(f" >> Testing {name}") | ||
deepcopy(vectorizer).fit_transform(X=[diag1[0], diag2[0], diag3[0]]) | ||
|
||
|
||
def test_transform_empty(): | ||
print(f" > Testing `transform_empty`.") | ||
for name, vectorizer in vectorizers.items(): | ||
print(f" >> Testing {name}") | ||
copy_vec = deepcopy(vectorizer).fit(X=[diag1[0], diag2[0]]) | ||
copy_vec.transform(X=[diag3[0], diag3[0]]) | ||
|
||
|
||
def test_set_output(): | ||
print(f" > Testing `set_output`.") | ||
try: | ||
import pandas | ||
for name, vectorizer in vectorizers.items(): | ||
print(f" >> Testing {name}") | ||
deepcopy(vectorizer).set_output(transform="pandas") | ||
except ImportError: | ||
print("Missing pandas, skipping set_output test") | ||
|
||
|
||
def test_compose(): | ||
print(f" > Testing composition with `sklearn.compose.ColumnTransformer`.") | ||
from sklearn.compose import ColumnTransformer | ||
for name, vectorizer in vectorizers.items(): | ||
print(f" >> Testing {name}") | ||
ct = ColumnTransformer([ | ||
(f"{name}-0", deepcopy(vectorizer), 0), | ||
(f"{name}-1", deepcopy(vectorizer), 1), | ||
(f"{name}-2", deepcopy(vectorizer), 2)] | ||
) | ||
ct.fit_transform(X=[diag1, diag2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #1062, for Landscapes, the fallback is a "useless" grid [-inf, -inf, -inf, ...] for which transform will produce only zeros. Here, instead of picking points far enough that they don't matter, you try to put them in a place where at least for some datasets they might be useful. Both approaches look ok to me (they both also print a message), I just wanted to point out the inconsistency (that won't block the PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the latest commits from me are now consistent with this behaviour: if some centers cannot be produced by "fitting some data", they will be assigned to [-np.inf, -np.inf, ...] and they transform anything into 0.