Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: model_selection module docstrings #732

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mkalimeri
Copy link
Contributor

@mkalimeri mkalimeri commented Jan 22, 2025

Description

Related to #596

Examples added for

  • model_selection.TimeGapSplit
  • model_selection.GroupTimeSeriesSplit
  • model_selection.ClusterFoldValidation (I see that KlusterFoldValidation has been renamed as ClusterFoldValidation)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the style guidelines (ruff)
  • I have commented my code, particularly in hard-to-understand areas
  • [NA] I have made corresponding changes to the documentation (also to the readme.md)
  • [NA] I have added tests that prove my fix is effective or that my feature works
  • [NA] I have added tests to check whether the new feature adheres to the sklearn convention
  • New and existing unit tests pass locally with my changes

If you feel your PR is ready for a review, ping @FBruzzesi or @koaning.

@FBruzzesi FBruzzesi changed the title Update model_selection.py docs: model_selection module docstrings Jan 24, 2025
Copy link
Collaborator

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mkalimeri!

The only concern I have is related to the part of code in TimeGapSplit summary. @koaning WDYT?
We can keep it as a follow up, but that's certainly not the right assumption to make in this context

### Start date 2024-01-01 00:00:00
### End date 2024-01-01 00:00:09
### Period 0 days 00:00:09
### Unique days 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this returning 10 days? 🤔

Edit: Looking into the code, this is fairly random assumption that the number of unique dates are in days.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should not be unique days but maybe timepoints? Shall we leave this as is for now and open another issue to generate additional/different information in the summary? Happy to pick that up!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a nit that would be nice to tackle. I would also totally be open to doing that in this PR, less waiting for merges that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that. I will come back to you with a plan of what information I would include. Updating this would require changes in tests as well, so it will be more than one file. Is that ok?

Copy link
Contributor Author

@mkalimeri mkalimeri Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I got to take a look at this today. As @FBruzzesi mentioned, we can't assume that the number of days is relevant to all use cases. We could keep the field and correct the calculation, to give an idea of the frequency of data, but it might not be relevant if the granularity of the time series is lower. Another idea is to add the granularity of the time series as a field of the summary. What are your thoughts?

# Create dataset
np.random.seed(1)
num_rows = 30
df = pd.DataFrame(np.random.randn(num_rows, 4)).rename(columns={0: 'c1', 1: 'c2', 2: 'c3', 3: 'c4'})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can pass the column names directly:

Suggested change
df = pd.DataFrame(np.random.randn(num_rows, 4)).rename(columns={0: 'c1', 1: 'c2', 2: 'c3', 3: 'c4'})
df = pd.DataFrame(np.random.randn(num_rows, 4), columns=["c1", "c2", "c3", "c4"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants