Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] WithoutLiersCV model selection #595

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions sklego/model_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -731,3 +731,105 @@ def _regroup(self, groups):
# create a mapper to set every group to the right group_id
mapper = dict(zip(df["index"], df["group"]))
return np.vectorize(mapper.get)(groups)


class WithoutLiersCV:
"""A custom cross-validation splitter that filters out data points labeled as anomalies during the splitting process
to exclude them from the training sets.

The anomaly label is specified by the `anomalous_label` parameter. Data points with this label will be excluded
during the splitting process from the training sets, but will be included in the test sets.

Parameters
----------
cv : CV Splitter instance
The base cross-validation splitter used to split the data. It must have a `split` method that returns
a generator of (train_index, test_index) tuples.
anomalous_label : int, default=-1
The label used to identify anomalous data points in the target labels.
Data points with this label will be excluded during the splitting process from the training sets.

Example
-------
```py
import numpy as np
from sklearn.model_selection import KFold
from sklego.model_selection import WithoutLiersCV

np.random.seed(1)
X = np.random.randn(100, 3)
y = np.random.randn(100) > 1.5 # 7% of the data is labeled as anomalous

cv = WithoutLiersCV(
cv=KFold(n_splits=3),
anomalous_label=1
)
Comment on lines +763 to +766
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd want @MBrouns to weight in on the name 😅 just to make sure.

But I'm also wondering if it's perhaps easier to the enduser to not require an anomalous label ... wouldn't it perhaps be better to pass in an outlier model? this outlier model could then internally train on X and determine which items are outliers. Or am I overthinking?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the conversation in the issue my understanding is slightly different. The goal of the CV is to validate anomaly detectors that do not train with different labels, namely the novelty detection ones. Therefore passing a novelty detection model would not be possible in the first place.

Now I agree that the name would suit both implementations 😁

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koaning Potentially we could have two CV strategies:

  • WithoutLiersCV: takes any outlier detection model, train on X, and excludes outliers from train_indexes
  • NoveltyDetectorCV: what's in this PR, to be able to train a novelty detection algorithm on non-anomalous labels and evaluate on both anomalous and not.


for train_index, test_index in cv.split(X, y):
print(f"Train samples: {len(train_index)}", f"Test samples: {len(test_index)}", sep="\n")

'''
Train samples: 62
Test samples: 34
Train samples: 60
Test samples: 33
Train samples: 64
Test samples: 33
'''
```

Note
----
The `WithoutLiersCV` class is designed to work in conjunction with standard cross-validation techniques and
evaluate anomaly detection estimators that do not work well with outliers in the training dataset.
Such class of estimators are typically trained on inliers only, and therefore, the training set should not
contain any outliers
(cf. [Novelty Detection](https://scikit-learn.org/stable/modules/outlier_detection.html#novelty-detection)).
"""

def __init__(self, cv, anomalous_label=-1):
self.cv = cv
self.anomalous_label = anomalous_label

def split(self, X, y, groups=None):
"""Generate indices to split data into training and test set, excluding data points with the specified
anomalous label from the training set.

Parameters
----------
X : array-like of shape (n_samples, n_features)
Training data to be split.
y : array-like of shape (n_samples,)
The target variable with anomalous labels to be excluded from the training set.
groups : array-like | None, default=None
Group labels for the samples, used for group-based cross-validation.

Yields
------
train_index : array
An array of indices representing the training set without anomalous data points.
test_index : array
An array of indices representing the test set, including anomalous data points.
"""
for train_index, test_index in self.cv.split(X, y, groups):
inliner_index = np.where(y[train_index] != self.anomalous_label)[0]
yield train_index[inliner_index], test_index

def get_n_splits(self, X=None, y=None, groups=None):
"""Return the number of splitting iterations in the cross-validator.

Parameters
----------
X : array-like | None, default=None
The input data to be split. May be ignored by some cross-validation methods.
y : array-like | None, default=None
The target labels corresponding to the input data. May be ignored by some cross-validation methods.
groups : array-like | None, default=None
Group labels for the samples. May be ignored by some cross-validation methods.

Returns
-------
n_splits : int
The number of splitting iterations in the cross-validator.
"""
return self.cv.get_n_splits(X, y, groups)
28 changes: 28 additions & 0 deletions tests/test_model_selection/test_withoutlierscv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import numpy as np
import pytest
from sklearn.model_selection import GroupKFold, GroupShuffleSplit, KFold, StratifiedKFold

from sklego.model_selection import WithoutLiersCV


@pytest.mark.parametrize(
"cv_strategy", [KFold(2), KFold(3, shuffle=True), StratifiedKFold(5), GroupKFold(2), GroupShuffleSplit(3)]
)
@pytest.mark.parametrize("anomalous_label", [-1, 1])
def test_split_without_anomalies(cv_strategy, anomalous_label):

size = 1000

X = np.random.randn(size, 3)
y = (np.random.randn(size) > 1.5).astype(int)
groups = np.random.randint(0, 10, size)

y[y == 1] = anomalous_label

cv = WithoutLiersCV(cv_strategy, anomalous_label=anomalous_label)

for inliner_index, test_index in cv.split(X, y, groups):
y_train = y[inliner_index]
assert np.all(y_train != anomalous_label)

assert cv.get_n_splits(X, y, groups) == cv_strategy.get_n_splits(X, y, groups)