Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Strange interaction between cuDF spilling and rmm.statistics #1792

Open
TomAugspurger opened this issue Jan 24, 2025 · 3 comments
Open
Labels
bug Something isn't working

Comments

@TomAugspurger
Copy link

TomAugspurger commented Jan 24, 2025

Describe the bug

There's a strange interaction between rmm.statistics and cuDF's spill=True option, where the first time a cudf.DataFrame is initialized with this option set, the initial time rmm.push_statistics(); rmm.pop_statistics() is called, then the return value of pop_statistics is None.

Steps/Code to reproduce bug

Here's a reproducer that uses cudf:

import rmm
import cudf


def f():
    cudf.set_option("spill", True)
    rmm.statistics.enable_statistics()
    cudf.DataFrame()
    rmm.statistics.push_statistics()
    print(rmm.statistics.pop_statistics())


def main():
    f()
    f()


if __name__ == "__main__":
    main()

The output is

CUDF_VISIBLE_DEVICES=1 python bug4.py
None
Statistics(current_bytes=0, current_count=0, peak_bytes=0, peak_count=0, total_bytes=0, total_count=0)

Expected behavior
Return Statistics both times.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Method of RMM install: conda
    • If method of install is [Docker], provide docker pull & docker run commands used
  • Please run and attach the output of the rmm/print_env.sh script to gather relevant environment details
**git***

***OS Information***
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2023-03-27-13-31-04"
DGX_SWBUILD_VERSION="5.5.0"
DGX_COMMIT_ID="b2e06e0"
DGX_PLATFORM="DGX Server for DGX-1"
DGX_SERIAL_NUMBER="QTFCOU8220028"

DGX_OTA_VERSION="5.6.0"
DGX_OTA_DATE="Wed 22 May 2024 01:41:19 PM PDT"
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
Linux dgx12 5.4.0-182-generic #202-Ubuntu SMP Fri Apr 26 12:29:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

***GPU Information***
Fri Jan 24 06:53:50 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0              56W / 300W |   2327MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:07:00.0 Off |                    0 |
| N/A   30C    P0              42W / 300W |      5MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:0A:00.0 Off |                    0 |
| N/A   29C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:0B:00.0 Off |                    0 |
| N/A   29C    P0              41W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           On  | 00000000:85:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           On  | 00000000:86:00.0 Off |                    0 |
| N/A   32C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           On  | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB           On  | 00000000:8A:00.0 Off |                    0 |
| N/A   29C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3631613      C   ...rger/envs/kvikio-env/bin/python3.12     2224MiB |
+---------------------------------------------------------------------------------------+

***CPU***
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             80
On-line CPU(s) list:                0-79
Thread(s) per core:                 2
Core(s) per socket:                 20
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              79
Model name:                         Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:                           1
CPU MHz:                            3046.674
CPU max MHz:                        3600.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           4390.15
Virtualization:                     VT-x
L1d cache:                          1.3 MiB
L1i cache:                          1.3 MiB
L2 cache:                           10 MiB
L3 cache:                           100 MiB
NUMA node0 CPU(s):                  0-19,40-59
NUMA node1 CPU(s):                  20-39,60-79
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: Split huge pages
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

***CMake***
/usr/bin/cmake
cmake version 3.16.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).

***g++***
/usr/bin/g++
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


***nvcc***
/raid/toaugspurger/envs/tmp-cudf/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

***Python***
/raid/toaugspurger/envs/tmp-cudf/bin/python
Python 3.12.8

***Environment Variables***
PATH                            : /home/nfs/toaugspurger/.local/bin:/raid/toaugspurger/envs/tmp-cudf/bin:/usr/local/cuda/bin:/opt/bin:/home/nfs/toaugspurger/.local/bin:/home/nfs/toaugspurger/.local/bin:/raid/toaugspurger/envs/nemo-curator/bin:/home/nfs/toaugspurger/miniforge3/condabin:/usr/local/cuda/bin:/opt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/local/cuda/bin
LD_LIBRARY_PATH                 : 
NUMBAPRO_NVVM                   : 
NUMBAPRO_LIBDEVICE              : 
CONDA_PREFIX                    : /raid/toaugspurger/envs/tmp-cudf
PYTHON_PATH                     : 

***conda packages***
/home/nfs/toaugspurger/miniforge3/condabin/conda
# packages in environment at /raid/toaugspurger/envs/tmp-cudf:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
asttokens                 3.0.0              pyhd8ed1ab_1    conda-forge
attr                      2.5.1                h166bdaf_1    conda-forge
aws-c-auth                0.8.1                h205f482_0    conda-forge
aws-c-cal                 0.8.1                h1a47875_3    conda-forge
aws-c-common              0.10.6               hb9d3cd8_0    conda-forge
aws-c-compression         0.3.0                h4e1184b_5    conda-forge
aws-c-event-stream        0.5.0               h7959bf6_11    conda-forge
aws-c-http                0.9.2                hefd7a92_4    conda-forge
aws-c-io                  0.15.3               h173a860_6    conda-forge
aws-c-mqtt                0.11.0              h11f4f37_12    conda-forge
aws-c-s3                  0.7.9                he1b24dc_1    conda-forge
aws-c-sdkutils            0.2.2                h4e1184b_0    conda-forge
aws-checksums             0.2.2                h4e1184b_4    conda-forge
aws-crt-cpp               0.29.9               he0e7f3f_2    conda-forge
aws-sdk-cpp               1.11.489             h4d475cb_0    conda-forge
azure-core-cpp            1.14.0               h5cfcd09_0    conda-forge
azure-identity-cpp        1.10.0               h113e628_0    conda-forge
azure-storage-blobs-cpp   12.13.0              h3cf044e_1    conda-forge
azure-storage-common-cpp  12.8.0               h736e048_1    conda-forge
azure-storage-files-datalake-cpp 12.12.0              ha633028_1    conda-forge
bzip2                     1.0.8                h4bc722e_7    conda-forge
c-ares                    1.34.4               hb9d3cd8_0    conda-forge
ca-certificates           2024.12.14           hbcca054_0    conda-forge
cachetools                5.5.1              pyhd8ed1ab_0    conda-forge
cuda-cccl_linux-64        12.6.77              ha770c72_0    conda-forge
cuda-crt-dev_linux-64     12.6.85              ha770c72_0    conda-forge
cuda-crt-tools            12.6.85              ha770c72_0    conda-forge
cuda-cudart               12.6.77              h5888daf_0    conda-forge
cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
cuda-nvcc-dev_linux-64    12.6.85              he91c749_0    conda-forge
cuda-nvcc-impl            12.6.85              h85509e4_0    conda-forge
cuda-nvcc-tools           12.6.85              he02047a_0    conda-forge
cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
cuda-nvvm-dev_linux-64    12.6.85              ha770c72_0    conda-forge
cuda-nvvm-impl            12.6.85              he02047a_0    conda-forge
cuda-nvvm-tools           12.6.85              he02047a_0    conda-forge
cuda-python               12.6.2          py312he9d8a76_2    conda-forge
cuda-version              12.6                 h7480c83_3    conda-forge
cudf                      25.02.00a296    cuda12_py312_250124_g0d2b29d3e6_296    rapidsai-nightly
cupy                      13.3.0          py312h7d319b9_2    conda-forge
cupy-core                 13.3.0          py312h1acd1a8_2    conda-forge
decorator                 5.1.1              pyhd8ed1ab_1    conda-forge
dlpack                    0.8                  h59595ed_3    conda-forge
exceptiongroup            1.2.2              pyhd8ed1ab_1    conda-forge
executing                 2.1.0              pyhd8ed1ab_1    conda-forge
fastrlock                 0.8.3           py312h6edf5ed_1    conda-forge
fmt                       11.0.2               h434a139_0    conda-forge
fsspec                    2024.12.0          pyhd8ed1ab_0    conda-forge
gflags                    2.2.2             h5888daf_1005    conda-forge
glog                      0.7.1                hbabe93e_0    conda-forge
ipython                   8.31.0             pyh707e725_0    conda-forge
jedi                      0.19.2             pyhd8ed1ab_1    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.21.3               h659f571_0    conda-forge
ld_impl_linux-64          2.43                 h712a8e2_2    conda-forge
libabseil                 20240722.0      cxx17_hbbce691_4    conda-forge
libarrow                  17.0.0          h461ed7b_45_cpu    conda-forge
libarrow-acero            17.0.0          hcb10f89_45_cpu    conda-forge
libarrow-dataset          17.0.0          hcb10f89_45_cpu    conda-forge
libarrow-substrait        17.0.0          h08228c5_45_cpu    conda-forge
libblas                   3.9.0           26_linux64_openblas    conda-forge
libbrotlicommon           1.1.0                hb9d3cd8_2    conda-forge
libbrotlidec              1.1.0                hb9d3cd8_2    conda-forge
libbrotlienc              1.1.0                hb9d3cd8_2    conda-forge
libcap                    2.71                 h39aace5_0    conda-forge
libcblas                  3.9.0           26_linux64_openblas    conda-forge
libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
libcublas                 12.6.4.1             hbd13f7d_0    conda-forge
libcudf                   25.02.00a296    cuda12_250124_g0d2b29d3e6_296    rapidsai-nightly
libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
libcufile                 1.11.1.6             h12f29b5_4    conda-forge
libcufile-dev             1.11.1.6             h5888daf_4    conda-forge
libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
libcurl                   8.11.1               h332b0f4_0    conda-forge
libcusolver               11.7.1.2             hbd13f7d_0    conda-forge
libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
libedit                   3.1.20240808    pl5321h7949ede_0    conda-forge
libev                     4.33                 hd590300_2    conda-forge
libevent                  2.1.12               hf998b51_1    conda-forge
libexpat                  2.6.4                h5888daf_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc                    14.2.0               h77fa898_1    conda-forge
libgcc-ng                 14.2.0               h69a702a_1    conda-forge
libgcrypt-lib             1.11.0               hb9d3cd8_2    conda-forge
libgfortran               14.2.0               h69a702a_1    conda-forge
libgfortran5              14.2.0               hd5240d6_1    conda-forge
libgomp                   14.2.0               h77fa898_1    conda-forge
libgoogle-cloud           2.34.0               h2b5623c_0    conda-forge
libgoogle-cloud-storage   2.34.0               h0121fbd_0    conda-forge
libgpg-error              1.51                 hbd13f7d_1    conda-forge
libgrpc                   1.67.1               h25350d4_1    conda-forge
libiconv                  1.17                 hd590300_2    conda-forge
libkvikio                 25.02.00a       cuda12_250124_g716a99f_30    rapidsai-nightly
liblapack                 3.9.0           26_linux64_openblas    conda-forge
libllvm14                 14.0.6               hcd5def8_4    conda-forge
liblzma                   5.6.3                hb9d3cd8_1    conda-forge
libnghttp2                1.64.0               h161d5f1_0    conda-forge
libnl                     3.11.0               hb9d3cd8_0    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
libopenblas               0.3.28          pthreads_h94d23a6_1    conda-forge
libparquet                17.0.0          h081d1f1_45_cpu    conda-forge
libprotobuf               5.28.3               h6128344_1    conda-forge
libre2-11                 2024.07.02           hbbce691_2    conda-forge
librmm                    25.02.00a40     cuda12_250124_g67fd94d0_40    rapidsai-nightly
libsqlite                 3.48.0               hee588c1_1    conda-forge
libssh2                   1.11.1               hf672d98_0    conda-forge
libstdcxx                 14.2.0               hc0a3c3a_1    conda-forge
libstdcxx-ng              14.2.0               h4852527_1    conda-forge
libsystemd0               257.2                h3dc2cb9_0    conda-forge
libthrift                 0.21.0               h0e7cc3e_0    conda-forge
libudev1                  257.2                h9a4d06a_0    conda-forge
libutf8proc               2.10.0               h4c51ac1_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libxml2                   2.13.5               h0d44e9d_1    conda-forge
libzlib                   1.3.1                hb9d3cd8_2    conda-forge
llvmlite                  0.43.0          py312h374181b_1    conda-forge
lz4-c                     1.10.0               h5888daf_1    conda-forge
markdown-it-py            3.0.0              pyhd8ed1ab_1    conda-forge
matplotlib-inline         0.1.7              pyhd8ed1ab_1    conda-forge
mdurl                     0.1.2              pyhd8ed1ab_1    conda-forge
ncurses                   6.5                  h2d0b736_2    conda-forge
numba                     0.60.0          py312h83e6fd3_0    conda-forge
numba-cuda                0.2.0              pyh267e887_0    conda-forge
numpy                     2.0.2           py312h58c1407_1    conda-forge
nvcomp                    4.1.0.6              h66a0f98_0    conda-forge
nvtx                      0.2.10          py312h66e93f0_2    conda-forge
openssl                   3.4.0                h7b32b05_1    conda-forge
orc                       2.0.3                h12ee42a_2    conda-forge
packaging                 24.2               pyhd8ed1ab_2    conda-forge
pandas                    2.2.3           py312hf9745cd_1    conda-forge
parso                     0.8.4              pyhd8ed1ab_1    conda-forge
pexpect                   4.9.0              pyhd8ed1ab_1    conda-forge
pickleshare               0.7.5           pyhd8ed1ab_1004    conda-forge
pip                       24.3.1             pyh8b19718_2    conda-forge
prompt-toolkit            3.0.50             pyha770c72_0    conda-forge
ptyprocess                0.7.0              pyhd8ed1ab_1    conda-forge
pure_eval                 0.2.3              pyhd8ed1ab_1    conda-forge
pyarrow                   17.0.0          py312h9cebb41_2    conda-forge
pyarrow-core              17.0.0          py312h01725c0_2_cpu    conda-forge
pygments                  2.19.1             pyhd8ed1ab_0    conda-forge
pylibcudf                 25.02.00a296    cuda12_py312_250124_g0d2b29d3e6_296    rapidsai-nightly
pynvjitlink               0.4.0           py312h9ee8e57_0    rapidsai-nightly
python                    3.12.8          h9e4cc4f_1_cpython    conda-forge
python-dateutil           2.9.0.post0        pyhff2d567_1    conda-forge
python-tzdata             2025.1             pyhd8ed1ab_0    conda-forge
python_abi                3.12                    5_cp312    conda-forge
pytz                      2024.1             pyhd8ed1ab_0    conda-forge
rdma-core                 55.0                 h5888daf_0    conda-forge
re2                       2024.07.02           h9925aae_2    conda-forge
readline                  8.2                  h8228510_1    conda-forge
rich                      13.9.4             pyhd8ed1ab_1    conda-forge
rmm                       25.02.00a40     cuda12_py312_250124_g67fd94d0_40    rapidsai-nightly
s2n                       1.5.11               h072c03f_0    conda-forge
setuptools                75.8.0             pyhff2d567_0    conda-forge
six                       1.17.0             pyhd8ed1ab_0    conda-forge
snappy                    1.2.1                h8bd8927_1    conda-forge
spdlog                    1.14.1               hed91bc2_1    conda-forge
stack_data                0.6.3              pyhd8ed1ab_1    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
traitlets                 5.14.3             pyhd8ed1ab_1    conda-forge
typing_extensions         4.12.2             pyha770c72_1    conda-forge
tzdata                    2025a                h78e105d_0    conda-forge
wcwidth                   0.2.13             pyhd8ed1ab_1    conda-forge
wheel                     0.45.1             pyhd8ed1ab_1    conda-forge
zstd                      1.5.6                ha6fb4c9_0    conda-forge

Additional context

The ordering of the cudf.DataFrame() does matter. It needs to be created after enable_statistics and before pop_statistics() to observe the reported behavior.

@TomAugspurger TomAugspurger added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 24, 2025
@TomAugspurger TomAugspurger changed the title [BUG] [BUG]: Strange interaction between cuDF spilling and rmm.statistics Jan 24, 2025
@bdice
Copy link
Contributor

bdice commented Jan 27, 2025

Thanks for the issue! Here are some notes on what is going on.

Enabling statistics wraps the current MR in an adaptor that tracks statistics (see implementation). Similarly, enabling spilling in cuDF will override the current MR with a spilling adaptor (see implementation). The spilling implementation uses a FailureCallbackResourceAdaptor. If a new allocation fails, it does some magic to spill existing allocations and then retries the allocation. As a result, we have two features (statistics and spilling) that compete to wrap a global resource, without explicitly stating that they are controlling that global resource.

This API design is somewhat flawed in my opinion. We should encourage a pattern that makes it clear that the global memory resource is changed by these features (note that the spilling option documents this behavior, but the downstream effects are not clear, as evidenced by this issue).

There may be a way to re-order these commands in the code snippet above to make it work better. RMM is happy to stack MR adaptors and can support this case in the abstract, but it will be difficult to make the existing APIs act in a more predictable way.

For the longer term fix, I would propose that we deprecate the existing statistics APIs (they pretend to act "globally" but are susceptible to modifications of the underlying memory resource). Instead, we should encourage users to explicitly control the memory resource like:

import rmm.mr

current_mr = rmm.mr.get_current_device_resource()
stats_mr = rmm.mr.StatisticsResourceAdaptor(current_mr)
rmm.mr.set_current_device_resource(stats_mr)

# do allocations

print(stats_mr.allocation_counts)

Likewise for spilling, we should encourage direct control of the memory resource. We can make a "nicer" name than FailureCallbackResourceAdaptor in cuDF so that this pattern could be implemented like:

import cudf
import rmm.mr

current_mr = rmm.mr.get_current_device_resource()
spilling_mr = cudf.SpillingResourceAdaptor(current_mr)  # SpillingResourceAdaptor does not exist yet
stats_mr = rmm.mr.StatisticsResourceAdaptor(spilling_mr)
rmm.mr.set_current_device_resource(stats_mr)

That way it tracks stats outside of the spilling framework (I think that is the desired behavior). If we implemented a cudf.SpillingResourceAdaptor, I would also advocate that we remove the spilling options and instead encourage direct control of the memory resource by wrapping it with the spilling adaptor.

@TomAugspurger @madsbk @vyasr I would be eager to hear your thoughts.

@bdice bdice removed the ? - Needs Triage Need team to review and classify label Jan 27, 2025
@madsbk
Copy link
Member

madsbk commented Jan 29, 2025

I think it is a good idea to make enabling statistics in RMM more explicit as you suggestion, and maybe add an enable-statistics argument to reinitialize().

But I don't think it helps cudf. If we want cudf.set_option("spill", True) to just work, cudf would still have to modify the current_device_resource implicitly.

@vyasr
Copy link
Contributor

vyasr commented Feb 7, 2025

I don't know how much people are currently using spilling right now. Most of the times that I've heard about people using it they are people internal to NVIDIA doing more advanced things and I suspect they would be OK with spilling not being quite so simple as an option. IOW I would be supportive of moving to a model where we explicitly create the mr for spilling as well. The problem is that cudf's spilling functionality is not as straightforward as just changing the mr. It also involves changing some of the core data structures that cudf uses (the buffers) to be "spilling-aware". So while enabling statistics can largely be done by simply changing the mr, we would need to come up with a different model for spilling that is more explicit than the current one with respect to the memory resource but still allows for the more extensive implicit internal changes that need to happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: To-do
Development

No branches or pull requests

4 participants