Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] scikit-learn fit() methods: add eval_X, eval_y, deprecate eval_set #6857

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lorentzenchr
Copy link
Contributor

@lorentzenchr lorentzenchr commented Mar 9, 2025

As discussed in scikit-learn/scikit-learn#28901 (comment), this PR adds eval_X and eval_y in order to make LGBM estimators compatible with scikit-learn's (as of version 1.6) Pipeline(..., transform_input=["eval_X"]).

See also scikit-learn/scikit-learn#27124.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It's looking like you're struggling to get this passing CI, so I'm going to put into "draft" for now. @ me any time here if you need help with development, and we can open this back up for review once CI is passing.

I saw you had multiple commits responding to linting errors... here's how to run those locally for faster feedback:

# (or conda, whatever you want)
pip install pre-commit
pre-commit run --all-files

And here's how to build locally and run the tests:

# step 1: compile lib_ligihgbtm
# (only need to do this once, because you're not making any C/C++ changes)
cmake -B build -S .
cmake --build build --target _lightgbm -j4

# step 2: install the Python package, re-using it
# (do this every time you change Python code in the library)
sh build-python.sh install --precompile

# step 3: run the scikit-learn tests
pytest tests/python_package_test/test_sklearn.py

@jameslamb jameslamb marked this pull request as draft March 9, 2025 23:15
@lorentzenchr
Copy link
Contributor Author

@jameslamb Thanks for your suggestions.
Could you already comment on the deprecation strategy, raising a warning?
Then, should I adapt all the (scikit-learn api) python tests and replace eval_set by the new eval_X, eval_y (thereby avoiding the warnings in the tests)?

@jameslamb
Copy link
Collaborator

Could you already comment on the deprecation strategy, raising a warning?

Making both options available for a time and raising a deprecation warning when eval_set if non-empty seems fine to me, if we decide to move forward with this. I'd also support a runtime error when both eval_set and eval_X are non-empty, to avoid taking on the complexity of merging those 2 inputs.

I'm sorry but I cannot invest much time in this right now (for example, looking into whether this would introduce inconsistencies with HistGradientBoostingClassifier, XGBoost, or CatBoost). If you want to see this change, getting CI working and then opening it up for others to review is probably the best path.

should I adapt all the (scikit-learn api) python tests and replace eval_set by the new eval_X, eval_y (thereby avoiding the warnings in the tests)?

No, please. As I said in scikit-learn/scikit-learn#28901 (comment), removing eval_set from LightGBM's scikit-learn estimators would be highly disruptive and requires a long deprecation cycle (more than a year, in my opinion). Throughout that time, we need to continue to test it at least as thoroughly as we have been.

@lorentzenchr
Copy link
Contributor Author

@jameslamb I'm sorry, I really need a maintainer's help. The tests in tests/python_package_test/test_dask.py fail even on the master branch, locally on my computer. I tried to play with different versions of dask, numpy, scipy, scikit-learn, no success.
TLDR: CI failure seems to be in master and not caused by this PR.

pytest -x tests/python_package_test/test_dask.py::test_ranker
...
>           dask_ranker = dask_ranker.fit(dX, dy, sample_weight=dw, group=dg)

tests/python_package_test/test_dask.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:1566: in fit
    self._lgb_dask_fit(
../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:1068: in _lgb_dask_fit
    model = _train(
../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:811: in _train
    results = client.gather(futures_classifiers)
../python3_lgbm/lib/python3.11/site-packages/distributed/client.py:2565: in gather
    return self.sync(
../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:205: in _train_part
    data = _concat([x["data"] for x in list_of_parts])

...

 def _concat(seq: List[_DaskPart]) -> _DaskPart:
        if isinstance(seq[0], np.ndarray):
            return np.concatenate(seq, axis=0)
        elif isinstance(seq[0], (pd_DataFrame, pd_Series)):
            return concat(seq, axis=0)
        elif isinstance(seq[0], ss.spmatrix):
            return ss.vstack(seq, format="csr")
        else:
>           raise TypeError(
                f"Data must be one of: numpy arrays, pandas dataframes, sparse matrices (from scipy). Got {type(seq[0]).__name__}."
            )
E           TypeError: Data must be one of: numpy arrays, pandas dataframes, sparse matrices (from scipy). Got tuple.

../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:159: TypeError

@jameslamb
Copy link
Collaborator

What versions of dask / distributed do you have installed?

Searching that error in the issue tracker here has a match: #6739

I suspect you need to pin to dask<2024.12 in your environment, as we do in CI here (#6742).

@lorentzenchr
Copy link
Contributor Author

@jameslamb Thank you so much. Pinning dask<2024.12 worked fine.

@lorentzenchr lorentzenchr marked this pull request as ready for review March 10, 2025 21:34
@lorentzenchr
Copy link
Contributor Author

The remaining CI failures seem unrelated.
TODO for myself: Improve test coverage a bit.

@jameslamb jameslamb changed the title ENH add eval_X, eval_y, deprecate eval_set [python-package] scikit-learn fit() methods: add eval_X, eval_y, deprecate eval_set Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants