Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some robustifying fixes on Atol fit and tests for vectorizers #1096

Merged
merged 14 commits into from
Jul 29, 2024
22 changes: 15 additions & 7 deletions src/python/gudhi/representations/vector_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -807,20 +807,28 @@ def fit(self, X, y=None, sample_weight=None):
self
"""
if not hasattr(self.quantiser, 'fit'):
raise TypeError("quantiser %s has no `fit` attribute." % (self.quantiser))

# In fitting we remove infinite death time points so that every center is finite
X = [dgm[~np.isinf(dgm).any(axis=1), :] for dgm in X]
raise TypeError(f"quantiser {self.quantiser} has no `fit` attribute.")

if sample_weight is None:
sample_weight = [self.get_weighting_method()(measure) for measure in X]

measures_concat = np.concatenate(X)
weights_concat = np.concatenate(sample_weight)

self.quantiser.fit(X=measures_concat, sample_weight=weights_concat)

# In fitting we remove infinite birth/death time points so that every center is finite
filtered_measures_concat = measures_concat[~np.isinf(measures_concat).any(axis=1), :] if len(measures_concat) else measures_concat
filtered_weights_concat = weights_concat[~np.isinf(measures_concat).any(axis=1)] if len(measures_concat) else weights_concat

n_clusters = self.quantiser.n_clusters
n_points = len(filtered_measures_concat)
if n_points < n_clusters:
# If not enough points to fit (including 0), let's arbitrarily put centers in [0, 1)^2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #1062, for Landscapes, the fallback is a "useless" grid [-inf, -inf, -inf, ...] for which transform will produce only zeros. Here, instead of picking points far enough that they don't matter, you try to put them in a place where at least for some datasets they might be useful. Both approaches look ok to me (they both also print a message), I just wanted to point out the inconsistency (that won't block the PR).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the latest commits from me are now consistent with this behaviour: if some centers cannot be produced by "fitting some data", they will be assigned to [-np.inf, -np.inf, ...] and they transform anything into 0.

print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding random points in [0, 1)^2.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is Atol restricted to 2D? I thought it wasn't restricted to diagrams and could work in any dimension. Though if we fit on no points, I don't know how to guess a dimension for the random centers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not, this is a hard one!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still the possibility of giving up (i.e. raise an exception if there are no points at all).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now favored and implemented this approach!

filtered_weights_concat = np.concatenate((filtered_weights_concat, np.ones(shape=(n_clusters - n_points))))
filtered_measures_concat = np.concatenate((filtered_measures_concat, np.random.random((n_clusters - n_points, 2))))

self.quantiser.fit(X=filtered_measures_concat, sample_weight=filtered_weights_concat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the number of points equals the number of requested clusters, are clustering algorithms usually good at finding all of the points as centers? Another possibility, in the "too few points" case, would be to write to self.centers directly and skip the use of the quantiser.

Copy link
Collaborator Author

@martinroyer martinroyer Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all thanks for all the great remarks.
I think on this specific question they are ok at that yes. And while I like your suggestion, if there's only say two points and we want five clusters, it is probably highly valuable to take those two points as cluster centers and do whatever for the other three.
So I am still thinking but sort of converging into doing the following:

  • filter more with np.unique (there is an overhead, because what we really only need to know is how many unique point up-to-n_clusters are there in the target set)
  • learn what we can on those unique points and fill in the rest of the centers with [-np.inf, -np.inf] (that would be closer to Landscapes) and for these centers write the inertias as 0 and hopefully that would not raise any errors at transform time.

Will work on that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • filter more with np.unique (there is an overhead, because what we really only need to know is how many unique point up-to-n_clusters are there in the target set)

Yeah, I don't think there is a function that begins sorting or hash-tabling and stops once it has found k distinct elements... I think the common case will be that the first n_clusters points are distinct, maybe you could first check that with np.unique (does it actually work on points?), and only go for something slower in other cases?

Actually, do we really care? Having a duplicated feature is not much worse than having a feature that is always 0, and this is only for a degenerate case that really shouldn't happen anyway, so I don't think we should make too much effort beyond "warn and don't crash". I only mentioned the possibility of duplicate points in passing, but it does not bother me.

  • fill in the rest of the centers with [-np.inf, -np.inf] (that would be closer to Landscapes) and for these centers write the inertias as 0 and hopefully that would not raise any errors at transform time.

As you like. Unless someone else chimes in, I don't really care what points you use as filler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we do not really care, so I have dropped this!

self.centers = self.quantiser.cluster_centers_

# Hack, but some people are unhappy if the order depends on the version of sklearn
self.centers = self.centers[np.lexsort(self.centers.T)]
if self.quantiser.n_clusters == 1:
Expand Down
90 changes: 90 additions & 0 deletions src/python/test/test_representations_interface.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
from copy import deepcopy
import numpy as np

from sklearn.cluster import KMeans

from gudhi.representations import (Atol, Landscape, Silhouette, BettiCurve, ComplexPolynomial, \
TopologicalVector, PersistenceImage, Entropy)

vectorizers = {
"atol": Atol(quantiser=KMeans(n_clusters=2, random_state=202312, n_init="auto")),
# "betti": BettiCurve(),
}

diag1 = [np.array([[0., np.inf],
[0., 8.94427191],
[0., 7.28010989],
[0., 6.08276253],
[0., 5.83095189],
[0., 5.38516481],
[0., 5.]]),
np.array([[11., np.inf],
[6.32455532, 6.70820393]]),
np.empty(shape=[0, 2])]

diag2 = [np.array([[0., np.inf],
[0., 8.94427191],
[0., 7.28010989],
[0., 6.08276253],
[0., 5.83095189],
[0., 5.38516481],
[0., 5.]]),
np.array([[11., np.inf],
[6.32455532, 6.70820393]]),
np.array([[0., np.inf],
[0., 1]])]

diag3 = [np.empty(shape=[0, 2])]


def test_fit():
print(f" > Testing `fit`.")
for name, vectorizer in vectorizers.items():
print(f" >> Testing {name}")
deepcopy(vectorizer).fit(X=[diag1[0], diag2[0]])


def test_fit_empty():
print(f" > Testing `fit_empty`.")
for name, vectorizer in vectorizers.items():
print(f" >> Testing {name}")
deepcopy(vectorizer).fit(X=[diag3[0], diag3[0]])


def test_transform():
print(f" > Testing `transform`.")
for name, vectorizer in vectorizers.items():
print(f" >> Testing {name}")
deepcopy(vectorizer).fit_transform(X=[diag1[0], diag2[0], diag3[0]])


def test_transform_empty():
print(f" > Testing `transform_empty`.")
for name, vectorizer in vectorizers.items():
print(f" >> Testing {name}")
copy_vec = deepcopy(vectorizer).fit(X=[diag1[0], diag2[0]])
copy_vec.transform(X=[diag3[0], diag3[0]])


def test_set_output():
print(f" > Testing `set_output`.")
try:
import pandas
for name, vectorizer in vectorizers.items():
print(f" >> Testing {name}")
deepcopy(vectorizer).set_output(transform="pandas")
except ImportError:
print("Missing pandas, skipping set_output test")


def test_compose():
print(f" > Testing composition with `sklearn.compose.ColumnTransformer`.")
from sklearn.compose import ColumnTransformer
for name, vectorizer in vectorizers.items():
print(f" >> Testing {name}")
ct = ColumnTransformer([
(f"{name}-0", deepcopy(vectorizer), 0),
(f"{name}-1", deepcopy(vectorizer), 1),
(f"{name}-2", deepcopy(vectorizer), 2)]
)
ct.fit_transform(X=[diag1, diag2])
Loading