Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinates ordering on CPU vs GPU #441

Open
Temigo opened this issue Feb 9, 2022 · 0 comments
Open

Coordinates ordering on CPU vs GPU #441

Temigo opened this issue Feb 9, 2022 · 0 comments

Comments

@Temigo
Copy link

Temigo commented Feb 9, 2022

Describe the bug

Given the exact same code, I observed that the coordinates ordering from some_tensor.C can change depending on the device (CPU vs GPU). Is this expected?


To Reproduce
This script is the smallest minimal example I could come up with:

import numpy as np
import MinkowskiEngine as ME
import torch


def reproduce(N, device):
    # Create x
    feats = torch.rand(N, 1).to(device)
    coords = torch.cat([torch.zeros((N, 1)), torch.rand(N, 3) * 100], dim=1).to(device)
    x = ME.SparseTensor(features=feats, coordinates=coords )

    # Create mask
    mask = (torch.rand(N, 6) > 0.5).float().to(device)
    mask = ME.SparseTensor(
        coordinates=x.C,
        features=mask,
        coordinate_manager=x.coordinate_manager,
        tensor_stride=x.tensor_stride,
    )

    # Create x0
    x0 = ME.SparseTensor(
        coordinates=x.C,
        features=torch.zeros(x.F.shape[0], mask.F.shape[1]).to(device),
        coordinate_manager=x.coordinate_manager,
        tensor_stride=x.tensor_stride
    )

    # print(x.C, mask.C, x0.C ) # These are all identical
    print('Do x, mask and x0 have all the same coordinates ordering? ', (x0.C == x.C).all() and (x0.C == mask.C).all())
    # No a priori reason but this set of coordinates is ordered differently on CPU, and identical to the previous one on GPU
    # print((mask + x0).C)
    print('Do mask + x0 and x0 have the same coordinates ordering? ', ((mask + x0).C == x0.C).all())

if __name__ == '__main__':
    print('Testing on CPU')
    reproduce(10, 'cpu')
    print('Testing on GPU')
    reproduce(10, 'cuda:0')

The output that I get is :

Testing on CPU
Do x, mask and x0 have all the same coordinates ordering?  tensor(True)
Do mask + x0 and x0 have the same coordinates ordering?  tensor(False)
Testing on GPU
Do x, mask and x0 have all the same coordinates ordering?  tensor(True, device='cuda:0')
Do mask + x0 and x0 have the same coordinates ordering?  tensor(True, device='cuda:0')

Expected behavior

  • If the coordinate ordering is going to change after a certain operation, I would expect the change to be consistent between CPU/GPU.
  • On GPU I never ever see the coordinate ordering change which is what I was initially expecting. This comment suggests that this behavior is however not guaranteed?

Desktop
==========System==========
Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.29
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
3.8.10 (default, Jun 2 2021, 10:49:15)
[GCC 9.4.0]
==========Pytorch==========
1.9.0+cu111
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 470.82.01
CUDA Version 11.4
VBIOS Version 90.02.30.40.85
Image Version G001.0000.02.04
GSP Firmware Version N/A
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
==========CC==========
/usr/bin/c++
c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.4
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010


Additional context
We heavily rely on the coordinate ordering of some_tensor.C for various operations such as masking. This "bug" (feature?) currently prevents our code from working on CPU. This was not an issue in the past, but I have not pinpointed if a specific version of ME started this behavior.

Referencing our code's original issue here.

Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant