Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when using RemoteFile.pread on multiple remote files #593

Open
TomAugspurger opened this issue Jan 22, 2025 · 17 comments

Comments

@TomAugspurger
Copy link
Contributor

The following script, which uses RemoteFile.open_s3 followed by a pread on multiple s3 objects, segfaults.

import cupy as cp
import kvikio

BUCKET = "kvikiobench-56481"

def main():

    paths = [
        f"data/parquet-large/timeseries.parquet/part.{i}.parquet" for i in range(4)
    ]

    def pread_one(path):
        rf = kvikio.RemoteFile.open_s3(BUCKET, path)
        buf = cp.empty(rf.nbytes() // 8)
        return rf.pread(buf)

    not_done = {pread_one(path) for path in paths}

    # spin lock, till we're done
    while not_done:
        not_done = {
            x for x in not_done if not x.done()
        }

    print("Parallel done")


if __name__ == "__main__":
    main()

I'm not a gdb expert, but here's some output indicating the failure happens in kvikio::RemoteHandle::read:

$ gdb python
[...]
Reading symbols from python...
(gdb) run bug.py
Starting program: /home/ubuntu/miniforge3/envs/kvikio-env/bin/python bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4400640 (LWP 8711)]
[New Thread 0x7ffff3a00640 (LWP 8712)]
[New Thread 0x7fffe3000640 (LWP 8713)]
[Detaching after vfork from child process 8714]
[New Thread 0x7fffd3c00640 (LWP 8715)]
[Thread 0x7fffd3c00640 (LWP 8715) exited]
[New Thread 0x7fffd3c00640 (LWP 8716)]
[New Thread 0x7fffc5e00640 (LWP 8720)]
[New Thread 0x7fffc1e00640 (LWP 8721)]
[New Thread 0x7fffc1400640 (LWP 8722)]

Thread 8 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc1e00640 (LWP 8721)]
0x00007fffd428bf97 in kvikio::RemoteHandle::read(void*, unsigned long, unsigned long) () from /home/ubuntu/miniforge3/envs/kvikio-env/lib/python3.12/site-packages/kvikio/_lib/../../../../libkvikio.so

As an aside, would this be the recommended way to read multiple objects in parallel, without having to use Python threads?

@wence-
Copy link
Contributor

wence- commented Jan 23, 2025

Ah, this is a nasty gotcha. pread doesn't hold a reference to buf, so you allocate buf in your pread_one function but it is dropped by the time we try to read into the actual pointer.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 23, 2025

Thanks. Would you expect having reference to buf in Python to prevent a segfault? Here's an example where we keep a reference but still get a segfault (this is using a public S3 bucket, so no credentials needed).

import cupy as cp
import kvikio


def pread_one(path):
    print("read", path)
    rf = kvikio.RemoteFile.open_http(path)
    buf = cp.empty(rf.nbytes() // 8)
    return (rf.pread(buf), rf, buf)

def main():

    paths = [f"https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/netcdf/FORCING/2023/20230101{i:0>2d}00.LDASIN_DOMAIN1" for i in range(12)]
    tuples = [pread_one(path) for path in paths]
    while tuples:
        tuples = [x for x in tuples if not x[0].done()]

    print("Parallel done")


if __name__ == "__main__":
    main()

@wence-
Copy link
Contributor

wence- commented Jan 23, 2025

Yeah that ought to keep things alive. Can you get a gdb backtrace with kvikio compiled in debug mode?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 23, 2025

Do you know whether build-all in the devcontainer compiles with debug mode? Or is it obvious from the backtrace what's going on?

edit: I see the build files are at /home/coder/kvikio/cpp/build/conda/cuda-12.5/release, so that's presumably not a debug build.

Thread 84 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb9b3fff640 (LWP 257518)]
0x00007fb9caec0f9a in kvikio::RemoteHandle::read(void*, unsigned long, unsigned long) () from /home/coder/kvikio/cpp/build/conda/cuda-12.5/release/libkvikio.so
(gdb) bt
#0  0x00007fb9caec0f9a in kvikio::RemoteHandle::read(void*, unsigned long, unsigned long) () from /home/coder/kvikio/cpp/build/conda/cuda-12.5/release/libkvikio.so
#1  0x00007fb9caec1f60 in std::_Function_handler<void (), BS::thread_pool::submit_task<kvikio::detail::submit_task<kvikio::RemoteHandle::pread(void*, unsigned long, unsigned long, unsigned long)::{lambda(void*, unsigned long, unsigned long, unsigned long)#1}, void*>(kvikio::RemoteHandle::pread(void*, unsigned long, unsigned long, unsigned long)::{lambda(void*, unsigned long, unsigned long, unsigned long)#1}, void*, unsigned long, unsigned long, unsigned long)::{lambda()#1}, unsigned long>(kvikio::detail::submit_task<kvikio::RemoteHandle::pread(void*, unsigned long, unsigned long, unsigned long)::{lambda(void*, unsigned long, unsigned long, unsigned long)#1}, void*>(kvikio::RemoteHandle::pread(void*, unsigned long, unsigned long, unsigned long)::{lambda(void*, unsigned long, unsigned long, unsigned long)#1}, void*, unsigned long, unsigned long, unsigned long)::{lambda()#1}&&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/coder/kvikio/cpp/build/conda/cuda-12.5/release/libkvikio.so
#2  0x00007fb9caeb5bd7 in BS::thread_pool::worker(unsigned int, std::function<void ()> const&) () from /home/coder/kvikio/cpp/build/conda/cuda-12.5/release/libkvikio.so
#3  0x00007fbc79689b65 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:104
#4  0x00007fbc7cc24ac3 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fbc7ccb5a04 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

@vyasr
Copy link
Contributor

vyasr commented Jan 24, 2025

Ah, this is a nasty gotcha. pread doesn't hold a reference to buf, so you allocate buf in your pread_one function but it is dropped by the time we try to read into the actual pointer.

@wence- should we change this? This doesn't seem like very friendly design to me. I feel like we want the IOFuture class to maintain a handle on the state of its inputs that can be deleted the first time that done() returns True. It's possible that it would make more sense to preserve the state in the lower level CuFile class, I'm not sure, but it seems like it should be kept alive by something somewhere.

Do you know whether build-all in the devcontainer compiles with debug mode? Or is it obvious from the backtrace what's going on?

No, it does not. By default build-all will build the default CMake configuration of a project, which we typically set to Release (in kvikio that is done here). You can get a debug build by doing build-all -DCMAKE_BUILD_TYPE=Debug (although RelWithDebInfo may also be sufficient for many cases).

Yeah that ought to keep things alive. Can you get a gdb backtrace with kvikio compiled in debug mode?

I just got one, and it looks rather strange. It's failing on this line trying to call setopt on the endpoint. It's not even going into that function, which seems to indicate that maybe the endpoint itself has somehow gone out of scope.

It's getting late for me so I'm signing off for now, but I can take a closer look tomorrow if nothing jumps out at you immediately here. Here's the backtrace though:

#0  0x00007ffde5ab7fed in kvikio::RemoteHandle::read (this=0x555555c7d3c0, buf=0x7ffd94800000, size=4194304, file_offset=8388608) at /home/coder/kvikio/cpp/src/remote_handle.cpp:354
#1  0x00007ffde5ab9400 in operator() (devPtr_offset=<optimized out>, file_offset=<optimized out>, size=<optimized out>, devPtr_base=<optimized out>, __closure=0x5555578cd9e0)
    at /home/coder/kvikio/cpp/src/remote_handle.cpp:400
#2  operator() (__closure=0x5555578cd9e0) at /home/coder/kvikio/cpp/include/kvikio/parallel_operation.hpp:38
#3  operator() (__closure=0x5555578cd9e0) at /home/coder/kvikio/cpp/build/conda/cuda-12.5/debug/_deps/bs_thread_pool-src/include/BS_thread_pool.hpp:622
#4  std::__invoke_impl<void, BS::thread_pool::submit_task<kvikio::detail::submit_task<kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::si
ze_t, std::size_t, std::size_t)>, void*>(kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_t, std::size_t, std::size_t)>, void*, std:
:size_t, std::size_t, std::size_t)::<lambda()> >(kvikio::detail::submit_task<kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_t, std
::size_t, std::size_t)>, void*>(kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_t, std::size_t, std::size_t)>, void*, std::size_t,
std::size_t, std::size_t)::<lambda()>&&)::<lambda()>&> (__f=...) at /home/coder/.conda/envs/rapids/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/invoke.h:61
#5  std::__invoke_r<void, BS::thread_pool::submit_task<kvikio::detail::submit_task<kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_
t, std::size_t, std::size_t)>, void*>(kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_t, std::size_t, std::size_t)>, void*, std::si
ze_t, std::size_t, std::size_t)::<lambda()> >(kvikio::detail::submit_task<kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_t, std::s
ize_t, std::size_t)>, void*>(kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_t, std::size_t, std::size_t)>, void*, std::size_t, std
::size_t, std::size_t)::<lambda()>&&)::<lambda()>&> (__fn=...) at /home/coder/.conda/envs/rapids/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/invoke.h:111
#6  std::_Function_handler<void(), BS::thread_pool::submit_task<kvikio::detail::submit_task<kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, s
td::size_t, std::size_t, std::size_t)>, void*>(kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_t, std::size_t, std::size_t)>, void*
, std::size_t, std::size_t, std::size_t)::<lambda()> >(kvikio::detail::submit_task<kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_
t, std::size_t, std::size_t)>, void*>(kvikio::RemoteHandle::pread(void*, std::size_t, std::size_t, std::size_t)::<lambda(void*, std::size_t, std::size_t, std::size_t)>, void*, std::si
ze_t, std::size_t, std::size_t)::<lambda()>&&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /home/coder/.conda/envs/rapids/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/std_function.h:290
#7  0x00007ffde5aad45e in std::function<void()>::operator() (this=0x7ffdb53ffdc0)
    at /home/coder/.conda/envs/rapids/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/std_function.h:591
#8  BS::thread_pool::worker (this=0x7ffde5adea20 <kvikio::defaults::instance()::_instance>, idx=<optimized out>, init_task=...)
    at /home/coder/kvikio/cpp/build/conda/cuda-12.5/debug/_deps/bs_thread_pool-src/include/BS_thread_pool.hpp:937

@madsbk
Copy link
Member

madsbk commented Jan 24, 2025

Ah, this is a nasty gotcha. pread doesn't hold a reference to buf, so you allocate buf in your pread_one function but it is dropped by the time we try to read into the actual pointer.

@wence- should we change this? This doesn't seem like very friendly design to me. I feel like we want the IOFuture class to maintain a handle on the state of its inputs that can be deleted the first time that done() returns True. It's possible that it would make more sense to preserve the state in the lower level CuFile class, I'm not sure, but it seems like it should be kept alive by something somewhere.

I guess we have three overall design options:

  1. No ownership neither in Python nor C++. This is more or less what we have now.
  2. Python has ownership, C++ has not. This is more or less how librmm/rmm and libcudf/cudf does it.
  3. Both Python and C++ implement full ownership management.

Generally, I would say that C++ developers expect minimal ownership and Python developers expect full ownership management.

I am leaning towards option 3 to keep the Python API as close as possible to the C++ API. And I don't think the performance overhead of ownership management will become an issue in this (fairly expensive) IO context.

Anyways, no matter the design option we choose, it should be implemented throughout KvikIO (not just for IOFutures).

@vyasr
Copy link
Contributor

vyasr commented Jan 24, 2025

I took another quick look at this and it appears that the issue is that the determination of when the operation is done has a bug. The underlying C++ RemoteHandle::pread call is not occurring until after the "Parallel done" print. @madsbk is the condition for determining when the future has completed correct? I'm wondering if we're correctly detecting that the future has completed the underlying task. What happens if we get future_status::deferred in is_future_done? Could the future be evaluating upon program exit, at which point the RemoteFile has already been garbage collected in Python?

@vyasr
Copy link
Contributor

vyasr commented Jan 24, 2025

Clarifying the above: even if you store the rf and buf objects created in the pread_one call into global variables or something else that persists them, you still see a seg fault because the reference counts are dropping to zero and the objects are cleaned up before the C++ pread call occurs. That indicates to me that the wait loop thinks the work is all done before it actually is.

@TomAugspurger
Copy link
Contributor Author

That indicates to me that the wait loop thinks the work is all done before it actually is.

That seems to be correct. Modifying the script slightly to call future.get() after all of them are "done":

import cupy as cp
import kvikio
import time


def pread_one(path):
    print("read", path)
    rf = kvikio.RemoteFile.open_http(path)
    buf = cp.empty(rf.nbytes() // 8)
    return (rf.pread(buf), rf, buf)

def main():

    paths = [f"https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/netcdf/FORCING/2023/20230101{i:0>2d}00.LDASIN_DOMAIN1" for i in range(12)]
    tuples = [pread_one(path) for path in paths]
    while tuples:
        done = []
        not_done = []
        for x in tuples:
            if x[0].done():
                done.append(x)
            else:
                not_done.append(x)
        tuples = not_done

    print("Parallel done")

    for i, (fut, _, _) in enumerate(done):
        print("get", i, time.monotonic())
        fut.get()


if __name__ == "__main__":
    main()

That outputs

...
read https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/netcdf/FORCING/2023/202301011000.LDASIN_DOMAIN1
read https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/netcdf/FORCING/2023/202301011100.LDASIN_DOMAIN1
Parallel done
get 0 9299799.092376588
get 1 9299805.973565483
get 2 9299812.671844428
...

That shows

  1. No segfault, even when the interpreter exits.
  2. There's ~7 seconds between .get calls, which I would guess means we're still reading data over the network.

@vyasr
Copy link
Contributor

vyasr commented Jan 24, 2025

I probably won't have time to look into this further for another week or so, and I don't know this code all that well so it'll probably take me a couple of hours to really root cause this. If Mads (or Lawrence) gets a chance sooner than that feel free to push, otherwise I'll take a look once some of the higher-priority items for this RAPIDS release are sorted out.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 25, 2025

Thanks, and no worries! I forgot to mention that this is pretty low priority for me right now as well. I might take a look later if I get a chance, but it's not a blocker.

@kingcrimsontianyu
Copy link
Contributor

The std::async call in KvikIO uses deferred launch policy. So until wait() is called, its status is always deferred, i.e. the is_future_done would always return true even if the task has not been started yet. I guess for Python this would cause the issues @vyasr described.

I'm proposing this PR #596 . This would get rid of the seg fault from the Python code.

@vyasr
Copy link
Contributor

vyasr commented Jan 26, 2025

Having skimmed the implementation here, I think that we need to take a step back here. What is the goal of the current design? In parallel_operation.hpp We are launching tasks on multiple threads using a manual thread pool (rather than doing so automatically using the async launch policy), but on each thread we are launching with deferred. In file_handle.cpp we talk about circumventing the thread pool, but we again launch with deferred. The changes in #596 make done a blocking operation, just as get is in Tom's newer snippet. If we are going to make this a blocking call in any case, what other benefits do we gain from the asynchrony here? What is our intent with combining asynchronous operations with multithreaded operations in this way? Shouldn't each thread actually be executing in the background (without blocking the main thread) while the main thread continues executing? That seems to me like the intended effect here, but please correct me if I am wrong.

@kingcrimsontianyu
Copy link
Contributor

kingcrimsontianyu commented Jan 26, 2025

I had the same questions when I first started 😄 . My understanding of our current design is that we want to eliminate the overhead of thread creation from the std::launch::async policy. Using the std::launch::deferred one ensures that the tasks are executed on the calling thread alone. The tasks here are all "trivial" in the sense that they are:

  • Simply querying the result of "bytes read/written" as in parallel_operation.hpp. The I/O requests submitted to the thread pool are evaluated as soon as idle threads become available, so the results of "bytes read/written" are immediately populated in the shared states of std::vector<std::future<std::size_t>> tasks as soon as the I/O requests finish (these futures are created by the promises from the thread pool).
  • Small I/O requests as in file_handle.cpp that unlikely become the bottleneck and can be justifiably evaluated lazily.

@vyasr
Copy link
Contributor

vyasr commented Jan 27, 2025

OK, upon second read of the parallel_operation.hpp code I see what we're doing there. The future returned from parallel_io is just calling get on each contained future, but the actual I/O tasks have been submitted to the pool in submit_task and are running asynchronously. The downside with the current approach is that the returned future is set up such that (because it is submitted with launch::deferred, which does not execute until wait or get is called on the future) there is no way to check if the data is available without blocking the main thread. IMHO the change in #596 isn't quite what we want because it converts done to a blocking operation. I would instead suggest that what we want is to also submit the gather_tasks lambda to the pool and use one of the pool's threads to cycle and check for completion. That preserves the ability to use the returned future in an asynchronous manner by checking for completion in a non-blocking way. WDYT? Alternatively, we could return a custom type instead of a std::future and have that type support querying the completion of all constituent tasks on the main thread without forcing a wait.

In the file_handle.cpp I don't follow your logic for

and can be justifiably evaluated lazily.

it seems like if these are unlikely to become the bottleneck, we should just execute them synchronously. For API consistency I could see returning a future, but it seems like we want to put the wait call directly here. There is no version where we call these APIs where we don't eventually want the result, right? So what's the benefit of doing it lazily if all the deferred work still has to be done eventually and it is also not going to be parallelized?

@madsbk
Copy link
Member

madsbk commented Jan 27, 2025

I would instead suggest that what we want is to also submit the gather_tasks lambda to the pool and use one of the pool's threads to cycle and check for completion. That preserves the ability to use the returned future in an asynchronous manner by checking for completion in a non-blocking way. WDYT?

Agree, I think this is the best solution and I don't think that the overhead is an issue now that we have gds_threshold (we didn't have the GDS threshold when IOFuture was implemented). But we should confirm the performance impact before deciding.

@kingcrimsontianyu
Copy link
Contributor

Thanks for the well thought-out design and feedback @vyasr @madsbk . #596 is updated with the implementation of non-blocking future query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants