Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (core dumped) error for multiple GPUs #47

Open
theonegis opened this issue Oct 23, 2018 · 6 comments
Open

Segmentation fault (core dumped) error for multiple GPUs #47

theonegis opened this issue Oct 23, 2018 · 6 comments

Comments

@theonegis
Copy link

Environment:

  • Python: 3.6
  • PyTorch: 0.4.0
  • OS: Ubuntu 18.04.1 LTS
  • CUDA: V9.1.85
  • GPU: Tesla K80
    Problem:
    I was running a model that does not need BatchNorm, so I changed the original DesneNet a little bit.
    Here is the code snippet:
def _cat_function_factory(conv, relu):
    def cat_function(*inputs):
        concated_features = torch.cat(inputs, 1)
        bottleneck_output = relu(conv(concated_features))
        return bottleneck_output
    return cat_function


class _DenseLayer(nn.Module):
    def __init__(self, num_input_features, growth_rate, bn_size, drop_rate):
        super(_DenseLayer, self).__init__()
        self.add_module('conv1', nn.Conv2d(num_input_features, bn_size * growth_rate, 1))
        self.add_module('relu1', nn.ReLU(inplace=True))
        self.add_module('conv2', nn.Conv2d(bn_size * growth_rate, growth_rate, 3, padding=1))
        self.add_module('relu2', nn.ReLU(inplace=True))
        self.drop_rate = drop_rate

    def forward(self, *inputs):
        cat_function = _cat_function_factory(self.conv1, self.relu1)
        if any(feature.requires_grad for feature in inputs):
            output = cp.checkpoint(cat_function, *inputs)
        else:
            output = cat_function(*inputs)
        new_features = self.relu2(self.conv2(output))
        if self.drop_rate > 0:
            new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
        return new_features


class _DenseBlock(nn.Module):
    def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate):
        super(_DenseBlock, self).__init__()
        for i in range(num_layers):
            layer = _DenseLayer(num_input_features + i * growth_rate,
                                growth_rate, bn_size, drop_rate)
            self.add_module(f'denselayer{i + 1}', layer)

    def forward(self, init_features):
        features = [init_features]
        for name, layer in self.named_children():
            new_features = layer(*features)
            features.append(new_features)
        return torch.cat(features, 1)

It can run on single GPU, but it throws a Segmentation fault (core dumped) error when running on multiple GPUS. What can be caused this issues?

@theonegis
Copy link
Author

I noticed that there was a similar issue in PyTorch repository Segfault in dataparallel + checkpoint #11732. It seems that it has not been fixed yet.

@Ushk
Copy link

Ushk commented Oct 24, 2018

@theonegis - I raised the original issue. Just to check whether they are similar problems, can you copy the faulthandler output here, to see if also points to cp.checkpoint being the issue?

import faulthandler
faulthandler.enable()

at the beginning of your code should output a traceback when your code segfaults.

(Apologies to the PyTorch devs if this is not helpful, I'm just curious)

@theonegis
Copy link
Author

@Ushk

Fatal Python error: Segmentation fault

Thread 0x00007f9894e36700 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 144 in <listcomp>
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 144 in _flatten_dense_tensors
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 119 in <listcomp>
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 119 in reduce_add_coalesced
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 39 in forward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 28 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Thread 0x00007f9855e18700 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 252 in _take_tensors
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 118 in reduce_add_coalesced
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 39 in forward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 28 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Thread 0x00007f9897637700 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/nccl.py", line 14 in is_available
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 76 in reduce_add
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 120 in reduce_add_coalesced
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 39 in forward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 28 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Current thread 0x00007f989ca35700 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 45 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 76 in apply

Thread 0x00007f9897e38700 (most recent call first):

Thread 0x00007f990ceb0740 (most recent call first):
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89 in backward
  File "/home/theonegis/Applications/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/tensor.py", line 93 in backward
  File "/home/theonegis/Developer/DenseNet/experiment.py", line 34 in train_on_epoch
  File "/home/theonegis/Developer/DenseNet/experiment.py", line 100 in _train
  File "/home/theonegis/Developer/DenseNet/experiment.py", line 141 in train
  File "run.py", line 57 in <module>

@gpleiss
Copy link
Owner

gpleiss commented Oct 25, 2018

@theonegis what happens if you upgrade to the latest stable version of PyTorch (0.4.1)?

@theonegis
Copy link
Author

@gpleiss Still the same problem.

@Ushk
Copy link

Ushk commented Oct 26, 2018

Yeah, just an FYI, I'm on 0.4.1 as well. And can see that yours is also a checkpoint issue. What happens if you checkpoint -all- of your layers @theonegis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants