Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early stopping overrides user defined X_val and y_val #363

Open
KeerthiNingegowda opened this issue Jan 2, 2025 · 4 comments
Open

Early stopping overrides user defined X_val and y_val #363

KeerthiNingegowda opened this issue Jan 2, 2025 · 4 comments

Comments

@KeerthiNingegowda
Copy link

Great package! Thank you for all the contributions.

In function def_partial_fit, validation datasets are overridden. Is not transparent and will be not quite logical in typical backtesting cases.

@jerome-f
Copy link

this seems to be because of the logic in partial_fit to handle sample_weights:

        # if early stopping is specified, split X,Y and sample weights (if given) into training and validation sets
        # This will overwrite any X_val and Y_val values passed by the user directly.
        if self.early_stopping_rounds is not None:
            early_stopping_rounds = self.early_stopping_rounds

            if sample_weight is None:
                X, X_val, Y, Y_val = train_test_split(
                    X,
                    Y,
                    test_size=self.validation_fraction,
                    random_state=self.random_state,
                )

            else:
                X, X_val, Y, Y_val, sample_weight, val_sample_weight = train_test_split(
                    X,
                    Y,
                    sample_weight,
                    test_size=self.validation_fraction,
                    random_state=self.random_state,
                )

so if you pass sample weight, user passed validation data gets ignored. You can override partial_fit in your code maybe but I am clueless as to why this is implemented as it is. It should be:

        if early_stopping_rounds is not None and (X_val is None or Y_val is None):
            if sample_weight is None:
                X, X_val, Y, Y_val = train_test_split(
                    X,
                    Y,
                    test_size=self.validation_fraction,
                    random_state=self.random_state,
                )
            else:
                X, X_val, Y, Y_val, sample_weight, val_sample_weight = train_test_split(
                    X,
                    Y,
                    sample_weight,
                    test_size=self.validation_fraction,
                    random_state=self.random_state,
                )

@alejandroschuler
Copy link
Collaborator

Make a PR!

@jerome-f
Copy link

@alejandroschuler happy to raise a PR but is there a reason why the initial implementation handled sample weights like such? One point to note is sample weights only contribute to the base learner dynamics and NLL calculation. It will get canceled out in fisher information, so sample weights have no impact on the natural gradients. NGBoost base class accepts val_sample_weights but I am not sure if this is a bug or a feature.

@alejandroschuler
Copy link
Collaborator

Honestly I do not remember if there was a reason! At this point your judgement is as good as mine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants