Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from conda-build to rattler-build #374

Open
wants to merge 42 commits into
base: branch-0.43
Choose a base branch
from

Conversation

gforsyth
Copy link
Contributor

@gforsyth gforsyth commented Feb 13, 2025

Porting over the conda-builds to rattler-build

Contributes to rapidsai/build-planning#47

@gforsyth gforsyth added the improvement Improves an existing functionality label Feb 13, 2025
@gforsyth gforsyth requested a review from a team as a code owner February 13, 2025 19:39
@gforsyth gforsyth added the non-breaking Introduces a non-breaking change label Feb 13, 2025
@gforsyth gforsyth requested a review from a team as a code owner February 13, 2025 19:39
version: ${{ version }}
build:
string: cuda${{ cuda_major }}_${{ date_string }}_${{ head_rev }}
script: install_libucxx.sh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are comments in this build-script that I'm not sure how to preserve in the script directive here, so I've opted to leave that standalone script in place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my proposal to use a string-block above would work here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see cmake --install cpp/build --component benchmarks in that script.

It might be worth checking that we intend to ship benchmarks with the main library here, as opposed to in a separate package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the comments in that script about clobbering (#20) should be resolved by including librmm in the host env. The librmm package should ship all the pieces that were being clobbered, and the build system should recognize not to package those because they came from a dependency.

Let's try to resolve that problem in this PR. I think it shouldn't be too difficult.

host:
- cuda-version =${{ cuda_version }}
- if: cuda_major == "11"
then: cuda-cudart-dev =${{ cuda_version }}
Copy link
Contributor Author

@gforsyth gforsyth Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixed cuda11 builds for me locally. See below for the failure without this pin where ucxx import tests are looking for libcudart.so.12.
Not sure if cuda-cudart-dev is constrained (or unconstrained) in some weird way version-wise, or if rattler-build is doing something unexpected re: version solving

( from https://github.com/rapidsai/ucxx/actions/runs/13318054370/job/37196857183)

 │ │ Installing test environment
 │ │  Successfully updated the test environment
 │ │ + python $SRC_DIR/conda_build_script.py
 │ │ Traceback (most recent call last):
 │ │   File "$SRC_DIR/conda_build_script.py", line 1, in <module>
 │ │     import ucxx
 │ │   File "$PREFIX/lib/python3.10/site-packages/ucxx/__init__.py", line 30, in <module>
 │ │     from . import exceptions, types, testing  # noqa
 │ │   File "$PREFIX/lib/python3.10/site-packages/ucxx/exceptions.py", line 4, in <module>
 │ │     from ._lib.libucxx import (  # noqa
 │ │   File "$PREFIX/lib/python3.10/site-packages/ucxx/_lib/__init__.py", line 4, in <module>
 │ │     from .libucxx import _create_exceptions
 │ │ ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
 │ │ × error Script failed with status 1
 │ │ × error Work directory: '/tmp/.tmpn812nb'
 │ │ × error To debug the build, run it manually in the work directory (execute the `./conda_build.sh` or `conda_build.bat` script)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr CUDA 11 packages aren't great.

I'm not actually sure what exactly is happening right now, but I think what is happening is that you are winding up with unconstrained nvidia channel packages. This shows up in CI even on the latest (passing) CUDA 11 builds like this one. If you look at the cuda-cudart-dev packages on conda-forge there aren't any for CUDA<12. Presumably we're being saved at runtime by something else bringing in cudatoolkit, namely probably librmm. You'll note that in the old recipe the condition for cuda-cudart-dev was {% if cuda_major != "11" %}. For CUDA 11 we are still reliant on including some CUDA libraries on the host system (not installed via conda).

Let's flip this constraint to != 11 and try again.

Separately, we should double-check if we even need to include the nvidia channel any more. I've lost track of the current state of things and would need to take 20 mins to refresh myself but I believe we may not need it any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(^everything Vyas said, and...)

CUDA 12 builds should not need the nvidia channel for anything. The only CUDA 11 packages we need from the nvidia channel are math libraries, so that we can get the C++ headers that don't ship with conda-forge::cudatoolkit for CUDA 11. I think that only affects RAFT/cuML/cuGraph/cuVS.

We are currently only including the nvidia channel for consistency with CUDA 11, and will drop it from our build commands once CUDA 11 is no longer supported. Perhaps we should even consider dropping nvidia now, for the packages that don't need its math library headers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would we prefer?

  • drop -c nvidia from the rapids-rattler-channel-string and then add it piecemeal to raft, cuml, cugraph, and cuvs?
  • wait until we drop cuda 11

Copy link
Contributor

@bdice bdice Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rapids-rattler-channel-string should check the value of RAPIDS_CUDA_VERSION and only add --channel nvidia on CUDA 11. We can implement that as part of the rattler-build migration, no need to wait until we drop CUDA 11.

Or, we can just drop --channel nvidia from rapids-rattler-channel-string entirely and add it to the repos where it is needed. I am okay with either approach.

Copy link
Contributor

@bdice bdice Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version of this package should be constrained by cuda-version. We should never pin CTK package versions in our recipes (for CUDA 12 only -- CUDA 11 packages from nvidia require exact pinnings like these). CUDA 12 packages should always float to whatever is allowed by cuda-version.

Suggested change
then: cuda-cudart-dev =${{ cuda_version }}
then: cuda-cudart-dev

Comment on lines 128 to 131
ignore_run_exports:
by_name:
- cuda-version
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this explicit cuda-version ignore, the libucxx-examples output was getting a runtime dependency of cuda-version>12.8,<13 -- inherited from the compiler('cuda'):

 │ │ Run dependencies:
 │ │ ╭──────────────────┬────────────────────────────────────────────────────────╮
 │ │ │ Name             ┆ Spec                                                   │
 │ │ ╞══════════════════╪════════════════════════════════════════════════════════╡
 │ │ │ Run dependencies ┆                                                        │
 │ │ │ libucxx          ┆ ==0.43.0a37 cuda12_250218_2a6940ca (PS)                │
 │ │ │ libgcc           ┆ >=13 (RE of [cache-build: gcc_linux-64])               │
 │ │ │ __glibc          ┆ >=2.28,<3.0.a0 (RE of [cache-build: sysroot_linux-64]) │
 │ │ │ cuda-version     ┆ >=12.8,<13 (RE of [cache-build: cuda-nvcc_linux-64])   │    <---- here
 │ │ │ libstdcxx        ┆ >=13 (RE of [cache-build: gxx_linux-64])               │
 │ │ │ libgcc           ┆ >=13 (RE of [cache-build: gxx_linux-64])               │
 │ │ │ ucx              ┆ >=1.15.0,<1.16.0a0 (RE of [cache-host: ucx])           │
 │ │ │ python_abi       ┆ 3.12.* *_cp312 (RE of [cache-host: python])            │
 │ │ │ librmm           ┆ >=25.4.0a26,<25.5.0a0 (RE of [cache-host: librmm])     │
 │ │ ╰──────────────────┴────────────────────────────────────────────────────────╯

this meant that in the test_cpp job running on cuda 12.0, the just-built package for examples was invalid, leading to the failures.

This looks to me like outputs in rattler-build are inheriting runtime dependencies from the build cache, which is usually fine but in this case is not what we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a recent conversation about ARM / Tegra CUDA support in conda-forge. We may be adding new run-exports to the CUDA compiler packages on ARM.

Currently we use ignore_run_exports_from the CUDA compiler package. Instead, we should ignore_run_exports: on just cuda-version. So what you're saying here is right, and we should do this universally for all packages that support CUDA compatibility within the major version (that applies to all of RAPIDS).


sccache --show-adv-stats

# remove build_cache directory
rm -rf "$RAPIDS_CONDA_BLD_OUTPUT_DIR"/build_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something that should be an ignore pattern in our upload script instead of something that every build script has to include.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We opened an issue upstream, too, to add a CLI option to clear the build-cache post-build: prefix-dev/rattler-build#1424

host:
- cuda-version =${{ cuda_version }}
- if: cuda_major == "11"
then: cuda-cudart-dev =${{ cuda_version }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr CUDA 11 packages aren't great.

I'm not actually sure what exactly is happening right now, but I think what is happening is that you are winding up with unconstrained nvidia channel packages. This shows up in CI even on the latest (passing) CUDA 11 builds like this one. If you look at the cuda-cudart-dev packages on conda-forge there aren't any for CUDA<12. Presumably we're being saved at runtime by something else bringing in cudatoolkit, namely probably librmm. You'll note that in the old recipe the condition for cuda-cudart-dev was {% if cuda_major != "11" %}. For CUDA 11 we are still reliant on including some CUDA libraries on the host system (not installed via conda).

Let's flip this constraint to != 11 and try again.

Separately, we should double-check if we even need to include the nvidia channel any more. I've lost track of the current state of things and would need to take 20 mins to refresh myself but I believe we may not need it any more.

name: distributed-ucxx
version: ${{ version }}
build:
string: py${{ python }}_${{ date_string }}_${{ head_rev }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the general case (i.e. for the rest of RAPIDS) I'd consider rapidsai/build-planning#43 out of scope for the rattler-build work since it requires reworking our shared workflows. For ucxx specifically, we should take at least a quick look at solving it here. The reason that I suggest that is that for ucxx, distributed-ucxx is (AFAIK) a pure Python package that is built within the same recipe as a non-pure Python package, whereas everywhere else those packages are separate conda builds. It should not be constrained by the cpython ABI below, for example. If the conclusion winds up just being that we need to split this package out to properly support building a pure Python wheel that is fine too, but I think it's worth thinking a bit about it at least briefly right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

building it as noarch now and that seems to be working

@gforsyth gforsyth removed the request for review from KyleFromNVIDIA February 20, 2025 18:40
cuda_version: ${{ (env.get("RAPIDS_CUDA_VERSION") | split("."))[:2] | join(".") }}
cuda_major: ${{ (env.get("RAPIDS_CUDA_VERSION") | split("."))[0] }}
date_string: ${{ env.get("RAPIDS_DATE_STRING") }}
py_version: ${{ env.get("RAPIDS_PY_VERSION") | version_to_buildstring }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want to use RAPIDS_PY_VERSION for this library. Unlike most of RAPIDS, we build ucxx for different Python versions using recipe variants. Previously we used {{ python | replace(".", "") }} to get the Python version string like 312.

I would remove this or replace it with something like {{ python | version_to_buildstring }}.

Suggested change
py_version: ${{ env.get("RAPIDS_PY_VERSION") | version_to_buildstring }}


build:
script:
content:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I wouldn't have expected this to take a list of lines, just a string / text block. It appears the spec supports both. However, I prefer to drop the - at the beginning of each line if possible. Can we use something like this instead?

content: |
  export ucxx_ROOT = "$(realpath ./cpp/build)"
  ./build.sh -n -v libucxx libucxx_python benchmarks examples tests --cmake-args=\"-DCMAKE_INSTALL_LIBDIR=lib\"

version: ${{ version }}
build:
string: cuda${{ cuda_major }}_${{ date_string }}_${{ head_rev }}
script: install_libucxx.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my proposal to use a string-block above would work here.

version: ${{ version }}
build:
string: cuda${{ cuda_major }}_${{ date_string }}_${{ head_rev }}
script: install_libucxx.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see cmake --install cpp/build --component benchmarks in that script.

It might be worth checking that we intend to ship benchmarks with the main library here, as opposed to in a separate package.

version: ${{ version }}
build:
string: cuda${{ cuda_major }}_${{ date_string }}_${{ head_rev }}
script: install_libucxx.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the comments in that script about clobbering (#20) should be resolved by including librmm in the host env. The librmm package should ship all the pieces that were being clobbered, and the build system should recognize not to package those because they came from a dependency.

Let's try to resolve that problem in this PR. I think it shouldn't be too difficult.

Comment on lines +177 to +179
- ${{ compiler("c") }}
- ${{ compiler("cxx") }}
- ${{ compiler("cuda") }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ${{ compiler("c") }}
- ${{ compiler("cxx") }}
- ${{ compiler("cuda") }}

- if: cuda_major == "11"
then: cudatoolkit
else: cuda-cudart-dev
- libucxx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pin subpackage?

Suggested change
- libucxx
- ${{ pin_subpackage("libucxx", exact=True) }}

Comment on lines +213 to +214
missing_dso_allowlist:
- "lib/libpython3.12.so*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, I think we expect this to link to Python. We shouldn't ignore this. This could be indicating an issue with our use of variant builds for each Python version?

- cuda-version =${{ cuda_version }}
- pynvml >=12.0.0,<13.0.0a0
run:
- python * *_cpython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this constraint (*_cpython) is unnecessary. Having python in host exports a python_abi constraint that I think covers this. I see we're ignoring that below but that might be a mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants