-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve test coverage in the catboost integration tests #18126
base: branch-25.04
Are you sure you want to change the base?
Improve test coverage in the catboost integration tests #18126
Conversation
@jameslamb you had some thoughts on the last PR, any interest in taking another stab at reviewing this one ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the @
, and for doing this follow-up PR!
I left a few more suggestions for your consideration. Leaving a non-blocking review... they're all things that shouldn't block a merge and that you can choose whether or not to do based on how much time you want to invest in this.
"fea_2": rng.choice([0, 1, 2, 3], size=1000), | ||
"fea_3": rng.choice(["A", "B", "C"], size=1000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fea_2": rng.choice([0, 1, 2, 3], size=1000), | |
"fea_3": rng.choice(["A", "B", "C"], size=1000), | |
"cat_feat_int_dtype": rng.choice([0, 1, 2, 3], size=1000), | |
"cat_feat_categorical_dtype": rng.choice(["A", "B", "C"], size=1000), |
Would you consider a change like this, or a code comment capturing this information if you'd prefer? Something really important is happening here...
fea_2
is testing "using anint
column as a categorical feature works"fea_3
is testing "using acategorical
column as a categorical feature works"
(based on #17931 (comment))
But I don't think that important difference would be obvious to someone looking at this test for the first time who isn't familiar with these boosting libraries.
"categorical_feature": rng.choice(["A", "B", "C"], size=100), | ||
"target": rng.integers(0, 2, size=100), | ||
"fea_1": rng.standard_normal(1000), | ||
"fea_2": rng.choice([0, 1, 2, 3], size=1000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choosing these values randomly like this is a risk. The categorical features may be noise to catboost
, meaning they wouldn't be chosen for any splits. And if they aren't chosen for any splits, then this test wouldn't be able to catch bugs in encoding/decoding that lead to the wrong representation (because predict()
would effectively ignore the values in these columns).
It's for concerns like this that the xgboost
example code I shared tries to create categorical features that have some relationship to the target: https://github.com/dmlc/xgboost/blob/105aa4247abb3ce787be2cef2f9beb4c24b30049/demo/guide-python/categorical.py#L40-L45
For these tests, I think the easiest way to do that would be something like this:
- use
sklearn.datasets.make_classification()
to generate the initial dataset (features there are not totally random... they have some relationship to the target) - to get an
int
categorical feature, replace one of those continuous columns with a discretized version (say, split it into deciles withnp.quantile()
andnp.digitize()
or something)... e.g. values in the lowest 10% coded a0
, next 10-20% as1
, etc. - to get a
category
feature, repeat that procedure (but with a different feature's values, and using a different cardinality), but instead of ints use strings like"low"
,"medium"
,"high"
(or whatever you want) and convert tocategory
dtype
And then in the test, assert that all the features were used at least once. I don't know the exact API for this in catboost
... in xgboost
/ lightgbm
it is a property on the estimator called .feature_importances_
, like in this example: https://stackoverflow.com/a/79435055/3986677
I know this seems like a lot, but I think it's important. When given a pandas
-like input to .predict()
, catboost
will try to do the same types of transformation as it did on the training data. There's potential for bugs in this integration there, that'd only be caught at predict()
time by ensuring the model actually uses the categorical features.
], | ||
) | ||
def test_catboost_train_test_split(X, y): | ||
from sklearn.model_selection import train_test_split | ||
|
||
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) | ||
model = CatBoostRegressor(iterations=10, verbose=0) | ||
model = CatBoostRegressor(iterations=50, verbose=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model = CatBoostRegressor(iterations=50, verbose=0) | |
model = CatBoostRegressor(iterations=50, random_state=708, verbose=0) |
For all of these tests, you should pass in a non-0 seed to control the randomness internal to catboost
. I'm not certain if catboost
training is deterministic given the parameters you've supplied, but this is still a good idea just to be safe and reduce the risk of flakiness in these tests.
Description
Follow up of #17931. This PR tests catboost with categorical features and increases the size of the the test data we fit our catboost models on.
Checklist