Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch Support #88

Open
ShivanshVij opened this issue Jan 15, 2025 · 2 comments
Open

PyTorch Support #88

ShivanshVij opened this issue Jan 15, 2025 · 2 comments

Comments

@ShivanshVij
Copy link

PyTorch statically compiles the CUDA Runtime API shared library (libcudart.so) which exposes the functions defined in the cuda_runtime.h header.

You can confirm this using the following Rust shared library code:

use std::marker;
use lazy_static::lazy_static;
use std::os::raw::{c_char, c_int, c_void};

pub struct Symbol<T> {
    pointer: *mut c_void,
    pd: marker::PhantomData<T>
}

unsafe impl<T: Send> Send for Symbol<T> {}
unsafe impl<T: Sync> Sync for Symbol<T> {}

impl<T> ::std::ops::Deref for Symbol<T> {
    type Target = T;
    fn deref(&self) -> &T {
        unsafe {
            &*(&self.pointer as *const *mut _ as *const T)
        }
    }
}

lazy_static! {
    static ref DLOPEN: Symbol<unsafe extern "C" fn(*const c_char, c_int) -> *mut c_void> = unsafe { std::mem::transmute(libc::dlsym(libc::RTLD_NEXT, b"dlopen\0".as_ptr() as *const c_char)) };
}

#[no_mangle]
pub extern "C" fn dlopen(filename: *const c_char, flag: c_int) -> *mut c_void {
    unsafe {
        if !filename.is_null() {
            match std::ffi::CStr::from_ptr(filename).to_str() {
                Ok(s) => {
                    println!("dlopen intercepted: {}", s);
                },
                Err(_) => {
                    println!("dlopen filename could not be parsed")
                }
            }
        } else {
            println!("dlopen called with invalid filename");
        };
        DLOPEN(filename, flag)
    }
}

The above shared library intercepts calls to dlopen, which should be called by pytorch if it's dynamically loading a shared library.

When we run this with: LD_PRELOAD=dlopen_interceptor.so python3 -c "import torch; print(torch.cuda.is_available())" we get the following logs:

dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_ctypes.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_struct.cpython-313-x86_64-linux-gnu.so
dlopen called with invalid filename
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_opcode.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/math.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_socket.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/zlib.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_bz2.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_lzma.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_bisect.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_random.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_datetime.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/select.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/array.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/fcntl.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_posixsubprocess.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/site-packages/torch/lib/libtorch_global_deps.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/site-packages/torch/_C.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: libcuda.so.1
dlopen intercepted: libcuda.so.1
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/site-packages/numpy/_core/_multiarray_umath.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_contextvars.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_pickle.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/site-packages/numpy/linalg/_umath_linalg.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_heapq.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/grp.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/mmap.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_json.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/binascii.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_queue.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/readline.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_hashlib.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_blake2.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/cmath.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_uuid.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_ssl.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_multiprocessing.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_asyncio.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: libcuda.so.1
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen intercepted: libcrypto.so.3
dlopen intercepted: libcuda.so.1
True

Grepping through this, it's clear that only libcuda.so.1 is ever loaded - this is the Device Driver API, and I'm assuming it's loaded by the statically compiled Runtime API library (I'm pretty sure that PyTorch doesn't use the Device Driver API directly).

It could be that such a simple program does not result in the Runtime API ever being invoked, so I wrote the following pytorch script:

import torch
import math

dtype = torch.float
device = torch.device("cuda:0")

# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

Running this as above (LD_PRELOAD=dlopen_interceptor.so python3 test.py) shows us the following:

dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_ctypes.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_struct.cpython-313-x86_64-linux-gnu.so
dlopen called with invalid filename
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_opcode.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/math.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_socket.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/zlib.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_bz2.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_lzma.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_bisect.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_random.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_datetime.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/select.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/array.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/fcntl.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_posixsubprocess.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/site-packages/torch/lib/libtorch_global_deps.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/site-packages/torch/_C.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: libcuda.so.1
dlopen intercepted: libcuda.so.1
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/site-packages/numpy/_core/_multiarray_umath.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_contextvars.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_pickle.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/site-packages/numpy/linalg/_umath_linalg.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_heapq.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/grp.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/mmap.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_json.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/binascii.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_queue.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/readline.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_hashlib.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_blake2.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/cmath.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_uuid.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_ssl.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_multiprocessing.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: /root/miniconda3/envs/pytorch3-shared/lib/python3.13/lib-dynload/_asyncio.cpython-313-x86_64-linux-gnu.so
dlopen intercepted: libcuda.so.1
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen intercepted: libcrypto.so.3
dlopen intercepted: libcuda.so.1
dlopen intercepted: libcuda.so.1
dlopen intercepted: libnvidia-ml.so.1
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen called with invalid filename
dlopen intercepted: libnvidia-ml.so.1
dlopen intercepted: libnvidia-ml.so.1
dlopen intercepted: libnvidia-ml.so.1
dlopen intercepted: libnvidia-ml.so.1
99 567.9957275390625
199 403.2545471191406
299 287.1043701171875
399 205.19406127929688
499 147.41769409179688
599 106.6561279296875
699 77.8930435180664
799 57.59303665161133
899 43.26356506347656
999 33.14699172973633
1099 26.003646850585938
1199 20.9589786529541
1299 17.395986557006836
1399 14.879144668579102
1499 13.101093292236328
1599 11.844833374023438
1699 10.957143783569336
1799 10.329837799072266
1899 9.886493682861328
1999 9.573140144348145
Result: y = 0.028899269178509712 + 0.8535827994346619 x + -0.004985605366528034 x^2 + -0.09288118034601212 x^3

The script runs without an issue and in the logs we can see that libnvidia-ml.so.1 is opened, which I think is part of the NVML library (and not statically compiled with PyTorch).

But - no CUDA Runtime API is loaded.

As far I understand it, without recompiling PyTorch to utilize a shared CUDA runtime library (instead of statically compiling it) it shouldn't be possible to use SCUDA to do GPU-over-IP with PyTorch without fully implementing the Device Driver API, including the parts where certain returned memory pointers are directly read/written to by the host.

@angzam78
Copy link

A possible 'solution' could be to custom build PyTorch with dynamic linking. The dynamically linked PyTorch CUDA calls would then be intercepted as expected possibly making it easier to use with SCUDA. I used to do this with RWTH-ACS/cricket more than a year ago.

This is also the approach used by nvwacloud/tensorlink (it's on GitHub but it is closed source). They supply a custom built Pytorch-2.1.2 with dynamically linked CUDA runtime library for use with their framework.

Unexpectedly, to me at least, dynamically linked binaries of PyTorch seem to be virtually inexistent online. Building it myself is a chore, it takes almost a day to compile on my computer. I am currently doing that with pytorch-v2.5.1 but it may be a few days before I get something workable.

So this is the build command I am using:

USE_CUDA=1 USE_STATIC_CUDA=0 USE_STATIC_CUDNN=0 USE_CUDA_STATIC_LINK=0 TORCH_CUDA_ARCH_LIST="6.0 6.1 7.5 8.0 8.6 8.9 9.0" EXTRA_NVCCFLAGS="--cudart=shared" python setup.py bdist_wheel

If you try building a dynamically linked PyTorch yourself make sure nvcc is actually being called with the option "--cudart=shared".
In past I have had the bad experience where even if using EXTRA_NVCCFLAGS="--cudart=shared" as per build documentation, upon examining the running build processes I could see the option was not being passed to nvcc!

So I use the following ugly hack...

 mv /usr/local/cuda/bin/nvcc /usr/local/cuda/bin/nvcc.bin
cat << EOF > /usr/local/cuda/bin/nvcc
#!/bin/bash
NVCC_PATH=/usr/local/cuda/bin/nvcc.bin
"\$NVCC_PATH" --cudart=shared "\$@"
EOF
chmod +x /usr/local/cuda/bin/nvcc

This way I am sure the option gets passed.

@nooodles2023
Copy link

Tensorlink is suspended now. We made a new project to support pytorch on remote nvidia gpu. For more information, please visit gpu.tf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants