Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve test coverage in the catboost integration tests #18126

Open
wants to merge 4 commits into
base: branch-25.04
Choose a base branch
from

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Feb 27, 2025

Description

Follow up of #17931. This PR tests catboost with categorical features and increases the size of the the test data we fit our catboost models on.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 27, 2025
@Matt711 Matt711 requested a review from a team as a code owner February 27, 2025 22:46
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Feb 27, 2025
@vyasr
Copy link
Contributor

vyasr commented Feb 28, 2025

@jameslamb you had some thoughts on the last PR, any interest in taking another stab at reviewing this one ;)

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the @, and for doing this follow-up PR!

I left a few more suggestions for your consideration. Leaving a non-blocking review... they're all things that shouldn't block a merge and that you can choose whether or not to do based on how much time you want to invest in this.

Comment on lines +107 to +108
"fea_2": rng.choice([0, 1, 2, 3], size=1000),
"fea_3": rng.choice(["A", "B", "C"], size=1000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"fea_2": rng.choice([0, 1, 2, 3], size=1000),
"fea_3": rng.choice(["A", "B", "C"], size=1000),
"cat_feat_int_dtype": rng.choice([0, 1, 2, 3], size=1000),
"cat_feat_categorical_dtype": rng.choice(["A", "B", "C"], size=1000),

Would you consider a change like this, or a code comment capturing this information if you'd prefer? Something really important is happening here...

  • fea_2 is testing "using an int column as a categorical feature works"
  • fea_3 is testing "using a categorical column as a categorical feature works"

(based on #17931 (comment))

But I don't think that important difference would be obvious to someone looking at this test for the first time who isn't familiar with these boosting libraries.

"categorical_feature": rng.choice(["A", "B", "C"], size=100),
"target": rng.integers(0, 2, size=100),
"fea_1": rng.standard_normal(1000),
"fea_2": rng.choice([0, 1, 2, 3], size=1000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choosing these values randomly like this is a risk. The categorical features may be noise to catboost, meaning they wouldn't be chosen for any splits. And if they aren't chosen for any splits, then this test wouldn't be able to catch bugs in encoding/decoding that lead to the wrong representation (because predict() would effectively ignore the values in these columns).

It's for concerns like this that the xgboost example code I shared tries to create categorical features that have some relationship to the target: https://github.com/dmlc/xgboost/blob/105aa4247abb3ce787be2cef2f9beb4c24b30049/demo/guide-python/categorical.py#L40-L45

For these tests, I think the easiest way to do that would be something like this:

  • use sklearn.datasets.make_classification() to generate the initial dataset (features there are not totally random... they have some relationship to the target)
  • to get an int categorical feature, replace one of those continuous columns with a discretized version (say, split it into deciles with np.quantile() and np.digitize() or something)... e.g. values in the lowest 10% coded a 0, next 10-20% as 1, etc.
  • to get a category feature, repeat that procedure (but with a different feature's values, and using a different cardinality), but instead of ints use strings like "low", "medium", "high" (or whatever you want) and convert to category dtype

And then in the test, assert that all the features were used at least once. I don't know the exact API for this in catboost... in xgboost / lightgbm it is a property on the estimator called .feature_importances_, like in this example: https://stackoverflow.com/a/79435055/3986677

I know this seems like a lot, but I think it's important. When given a pandas-like input to .predict(), catboost will try to do the same types of transformation as it did on the training data. There's potential for bugs in this integration there, that'd only be caught at predict() time by ensuring the model actually uses the categorical features.

],
)
def test_catboost_train_test_split(X, y):
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = CatBoostRegressor(iterations=10, verbose=0)
model = CatBoostRegressor(iterations=50, verbose=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model = CatBoostRegressor(iterations=50, verbose=0)
model = CatBoostRegressor(iterations=50, random_state=708, verbose=0)

For all of these tests, you should pass in a non-0 seed to control the randomness internal to catboost. I'm not certain if catboost training is deterministic given the parameters you've supplied, but this is still a good idea just to be safe and reduce the risk of flakiness in these tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.pandas Issues specific to cudf.pandas improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants