Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing sklearn set_output API for BaseEstimator s #1093

Merged
merged 2 commits into from
Jul 29, 2024

Conversation

martinroyer
Copy link
Collaborator

Simple example showing how to implement scikit-learn's set_output API and an example test.

This outputs diagrams nicely:

>>> RipsPersistence(homology_dimensions=[0, 2], n_jobs=-2).set_output(transform="pandas").fit_transform(point_clouds)
                                       H0  H2
0  [[0.0, 10.456032537513941], [0.0, inf]]  []
1  [[0.0, 12.925628779692705], [0.0, inf]]  []
2  [[0.0, 11.126843373147965], [0.0, inf]]  []
3  [[0.0, 12.647790894762348], [0.0, inf]]  []
4  [[0.0, 11.340625000427952], [0.0, inf]]  []

Let me add that although this PR is very simple, I am a bit sore that it took so long for me to write for the following reasons:

  • I am on windows (granted this is on me)
  • the test_sklearn_rips_persistence.py generates points using gudhi.datasets.generators.points
  • this gudhi.datasets.generators.points imports from gudhi.datasets.generators._points.cc which if I understood correctly requires CGAL in the gudhi computation step
  • CGAL is currently quite hard to install on some windows : for some reason if installed with conda install cgal-cpp, conda will not also install GMP (conda claims specs are incompatible) so gudhi will not be happy on compilation because it cannot find GMP, if installed through vcpkg the install will be stuck on Building x64-windows-dbg (see vcpkg getting stuck at installing package ("Building x64-windows-dbg") microsoft/vcpkg#31181 for detailed symptoms resembling mine), and installing GMP any other way seems nearly impossible on windows.
  • but all this turned to be actually unnecessary, as the only reason for gudhi.datasets.generators.points to import the _points.cc is to ... generate points on a sphere or a torus, which I reckon can easily be done in python too!

So we could probably get rid of this dependency if I understood that correctly...

Copy link
Member

@mglisse mglisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, except for one detail: @VincentRouvreau do we need to disable the test if pandas is not installed? (we have things like if(SKLEARN_FOUND) before the add_tests in python/CMakeLists.txt)

@mglisse mglisse modified the milestone: uch Jun 27, 2024
@mglisse
Copy link
Member

mglisse commented Jun 27, 2024

  • CGAL is currently quite hard to install on some windows : for some reason if installed with conda install cgal-cpp, conda will not also install GMP (conda claims specs are incompatible) so gudhi will not be happy on compilation because it cannot find GMP

On windows, gudhi and cgal normally use mpir instead of gmp (mpir should install files gmp.h, libgmp.*, etc for compatibility). If that doesn't work, it is worth investigating.

  • but all this turned to be actually unnecessary, as the only reason for gudhi.datasets.generators.points to import the _points.cc is to ... generate points on a sphere or a torus, which I reckon can easily be done in python too!

The policy so far has been that CGAL is a dependency for a core part of Gudhi (alpha-complex). And once it is there, we might as well use it, users don't care much where it comes from, but they might (ok, maybe not here) care if it becomes 10x slower, or less reliable.

It could make sense to try and simplify things for people who work on pure Python parts of Gudhi, by reusing a pre-built package (so they wouldn't even need a C++ compiler), but I don't know how hard that would be.

@VincentRouvreau
Copy link
Contributor

Looks good to me, except for one detail: @VincentRouvreau do we need to disable the test if pandas is not installed? (we have things like if(SKLEARN_FOUND) before the add_tests in python/CMakeLists.txt)

Yes this test should not be run if pandas is not installed (as it is an optinal dependency).
There 2 ways to do so:

  • Modify your test like that:
def test_set_output():
    try:
        import pandas
        NB_PC = 5
        point_clouds = [points.sphere(n_samples=random.randint(100, 150), ambient_dim=2) for _ in range(NB_PC)]
    
        rips = RipsPersistence(homology_dimensions=[0, 2], n_jobs=-2)
        diags_pandas = rips.set_output(transform="pandas").fit_transform(point_clouds)
        assert 'H0' == diags_pandas.columns[0]
        assert 'H2' == diags_pandas.columns[1]
        assert len(diags_pandas.index) == NB_PC
    except ImportError:
        pass
  • Separate your test in a dedicated file and add some cmake magic to run it only when pandas is installed (I can help if you choose this option)

@martinroyer
Copy link
Collaborator Author

martinroyer commented Jun 28, 2024

Looks good to me, except for one detail: @VincentRouvreau do we need to disable the test if pandas is not installed? (we have things like if(SKLEARN_FOUND) before the add_tests in python/CMakeLists.txt)

Yes this test should not be run if pandas is not installed (as it is an optinal dependency). There 2 ways to do so:

  • Modify your test like that:
def test_set_output():
    try:
        import pandas
        NB_PC = 5
        point_clouds = [points.sphere(n_samples=random.randint(100, 150), ambient_dim=2) for _ in range(NB_PC)]
    
        rips = RipsPersistence(homology_dimensions=[0, 2], n_jobs=-2)
        diags_pandas = rips.set_output(transform="pandas").fit_transform(point_clouds)
        assert 'H0' == diags_pandas.columns[0]
        assert 'H2' == diags_pandas.columns[1]
        assert len(diags_pandas.index) == NB_PC
    except ImportError:
        pass
  • Separate your test in a dedicated file and add some cmake magic to run it only when pandas is installed (I can help if you choose this option)

First option looks simple and therefore very good to me!

@martinroyer
Copy link
Collaborator Author

  • CGAL is currently quite hard to install on some windows : for some reason if installed with conda install cgal-cpp, conda will not also install GMP (conda claims specs are incompatible) so gudhi will not be happy on compilation because it cannot find GMP

On windows, gudhi and cgal normally use mpir instead of gmp (mpir should install files gmp.h, libgmp.*, etc for compatibility). If that doesn't work, it is worth investigating.

So maybe I was wrong in thinking I needed to install GMP then. The thing is, with cgal-cpp installed via conda, on running the cmake command it fails with:

-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.22631.
-- The C compiler identification is MSVC 19.40.33811.0
-- The CXX compiler identification is MSVC 19.40.33811.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.40.33807/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.40.33807/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- GUDHI version : 3.10.0rc2
CMake Warning (dev) at src/cmake/modules/GUDHI_third_party_libraries.cmake:3 (find_package):
  Policy CMP0167 is not set: The FindBoost module is removed.  Run "cmake
  --help-policy CMP0167" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

Call Stack (most recent call first):
  CMakeLists.txt:20 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Boost toolset is unknown (compiler MSVC 19.40.33811.0)
-- Boost toolset is unknown (compiler MSVC 19.40.33811.0)
-- Boost toolset is unknown (compiler MSVC 19.40.33811.0)
++ BOOST version 1.84.0. Includes found in C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/include, libraries found in C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib
-- Could NOT find GMP (missing: GMP_LIBRARIES GMP_INCLUDE_DIR)
-- Visual Leak Detector (VLD) is not found.
-- Using header-only CGAL
CMake Error at C:/Program Files/CMake/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find GMP (missing: GMP_LIBRARIES GMP_INCLUDE_DIR)
Call Stack (most recent call first):
  C:/Program Files/CMake/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  src/cmake/modules/FindGMP.cmake:54 (find_package_handle_standard_args)
  C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib/cmake/CGAL/CGAL_SetupGMP.cmake:24 (find_package)
  C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib/cmake/CGAL/CGAL_SetupCGALDependencies.cmake:37 (include)
  C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib/cmake/CGAL/CGALConfig.cmake:168 (include)
  src/cmake/modules/GUDHI_third_party_libraries.cmake:30 (find_package)
  CMakeLists.txt:20 (include)


-- Configuring incomplete, errors occurred!

Not sure what to do then...

@mglisse
Copy link
Member

mglisse commented Jun 28, 2024

    except ImportError:
        pass

Maybe print("Missing pandas, skipping set_output test") instead of pass?

@mglisse
Copy link
Member

mglisse commented Jun 28, 2024

Not sure what to do then...

Hmm, we'll need to investigate that 😞
Can you confirm that the package mpir is installed?
In C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/include, do you have files mpir.h and gmp.h? What libraries with similar names do you have in lib/ or bin/?

@mglisse
Copy link
Member

mglisse commented Jun 28, 2024

@VincentRouvreau IIRC you adapted FindGMP.cmake from CGAL? The one in CGAL seems to have had a few patches since then, maybe we need them as well...

There could also be bad interactions where CGAL_SetupGMP.cmake calls find_package(GMP) expecting its own FindGMP.cmake to be used, but ours ends up earlier in the path, so it has to be compatible.

@martinroyer
Copy link
Collaborator Author

Not sure what to do then...

Hmm, we'll need to investigate that 😞 Can you confirm that the package mpir is installed? In C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/include, do you have files mpir.h and gmp.h? What libraries with similar names do you have in lib/ or bin/?

Yep those two are there, mpir.h and gmp.h, and bin/ has gmp.dll, mpir.dll, and lib/ has gmp.lib and mpir.lib.

@martinroyer
Copy link
Collaborator Author

and mpir is installed, mpir 3.0.0 he025d50_1002 conda-forge

@mglisse
Copy link
Member

mglisse commented Jun 28, 2024

Strange, on windows, I created a fresh miniforge environment, installed cgal-cpp, cmake and git in this env, cloned the gudhi-devel repository, went to a build subdir, ran cmake .., and it did find GMP just fine...
Maybe you could run cmake with --debug-find-pkg=GMP so it tells us more?
The cmake you are using is the one from conda?

@martinroyer
Copy link
Collaborator Author

martinroyer commented Jun 28, 2024

Strange, on windows, I created a fresh miniforge environment, installed cgal-cpp, cmake and git in this env, cloned the gudhi-devel repository, went to a build subdir, ran cmake .., and it did find GMP just fine... Maybe you could run cmake with --debug-find-pkg=GMP so it tells us more? The cmake you are using is the one from conda?

Oh good find indeed! I get that find_path considered the following locations:

C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/include/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/mingw-w64/bin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/mingw-w64/bin/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/usr/bin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/usr/bin/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/bin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/bin/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Scripts/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Scripts/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/bin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/bin/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/condabin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/condabin/gmp.h

... but none of these locations is C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/include where the file really is! Any idea how to solve this?

@martinroyer
Copy link
Collaborator Author

I used a cmake I downloaded and installed, let's try with one from conda.

@martinroyer
Copy link
Collaborator Author

I used a cmake I downloaded and installed, let's try with one from conda.

... and that did it for me! -- Found GMP: C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib/gmp.lib. I could install and run the .cc files that I couldn't before.
Thanks Marc.

@VincentRouvreau VincentRouvreau added the 3.11.0 GUDHI version 3.11.0 label Jul 29, 2024
@VincentRouvreau VincentRouvreau merged commit b4977a2 into GUDHI:master Jul 29, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11.0 GUDHI version 3.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants