-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when using RemoteFile.pread on multiple remote files #593
Comments
Ah, this is a nasty gotcha. |
Thanks. Would you expect having reference to import cupy as cp
import kvikio
def pread_one(path):
print("read", path)
rf = kvikio.RemoteFile.open_http(path)
buf = cp.empty(rf.nbytes() // 8)
return (rf.pread(buf), rf, buf)
def main():
paths = [f"https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/netcdf/FORCING/2023/20230101{i:0>2d}00.LDASIN_DOMAIN1" for i in range(12)]
tuples = [pread_one(path) for path in paths]
while tuples:
tuples = [x for x in tuples if not x[0].done()]
print("Parallel done")
if __name__ == "__main__":
main() |
Yeah that ought to keep things alive. Can you get a gdb backtrace with kvikio compiled in debug mode? |
Do you know whether edit: I see the build files are at
|
@wence- should we change this? This doesn't seem like very friendly design to me. I feel like we want the IOFuture class to maintain a handle on the state of its inputs that can be deleted the first time that
No, it does not. By default build-all will build the default CMake configuration of a project, which we typically set to Release (in kvikio that is done here). You can get a debug build by doing
I just got one, and it looks rather strange. It's failing on this line trying to call setopt on the endpoint. It's not even going into that function, which seems to indicate that maybe the endpoint itself has somehow gone out of scope. It's getting late for me so I'm signing off for now, but I can take a closer look tomorrow if nothing jumps out at you immediately here. Here's the backtrace though:
|
I guess we have three overall design options:
Generally, I would say that C++ developers expect minimal ownership and Python developers expect full ownership management. I am leaning towards option 3 to keep the Python API as close as possible to the C++ API. And I don't think the performance overhead of ownership management will become an issue in this (fairly expensive) IO context. Anyways, no matter the design option we choose, it should be implemented throughout KvikIO (not just for IOFutures). |
I took another quick look at this and it appears that the issue is that the determination of when the operation is done has a bug. The underlying C++ |
Clarifying the above: even if you store the rf and buf objects created in the |
That seems to be correct. Modifying the script slightly to call import cupy as cp
import kvikio
import time
def pread_one(path):
print("read", path)
rf = kvikio.RemoteFile.open_http(path)
buf = cp.empty(rf.nbytes() // 8)
return (rf.pread(buf), rf, buf)
def main():
paths = [f"https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/CONUS/netcdf/FORCING/2023/20230101{i:0>2d}00.LDASIN_DOMAIN1" for i in range(12)]
tuples = [pread_one(path) for path in paths]
while tuples:
done = []
not_done = []
for x in tuples:
if x[0].done():
done.append(x)
else:
not_done.append(x)
tuples = not_done
print("Parallel done")
for i, (fut, _, _) in enumerate(done):
print("get", i, time.monotonic())
fut.get()
if __name__ == "__main__":
main() That outputs
That shows
|
I probably won't have time to look into this further for another week or so, and I don't know this code all that well so it'll probably take me a couple of hours to really root cause this. If Mads (or Lawrence) gets a chance sooner than that feel free to push, otherwise I'll take a look once some of the higher-priority items for this RAPIDS release are sorted out. |
Thanks, and no worries! I forgot to mention that this is pretty low priority for me right now as well. I might take a look later if I get a chance, but it's not a blocker. |
The I'm proposing this PR #596 . This would get rid of the seg fault from the Python code. |
Having skimmed the implementation here, I think that we need to take a step back here. What is the goal of the current design? In |
I had the same questions when I first started 😄 . My understanding of our current design is that we want to eliminate the overhead of thread creation from the
|
OK, upon second read of the In the
it seems like if these are unlikely to become the bottleneck, we should just execute them synchronously. For API consistency I could see returning a future, but it seems like we want to put the |
Agree, I think this is the best solution and I don't think that the overhead is an issue now that we have |
The following script, which uses
RemoteFile.open_s3
followed by apread
on multiple s3 objects, segfaults.I'm not a gdb expert, but here's some output indicating the failure happens in
kvikio::RemoteHandle::read
:As an aside, would this be the recommended way to read multiple objects in parallel, without having to use Python threads?
The text was updated successfully, but these errors were encountered: