Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with Levenshtein Transformer (build issue with libnat) #1346

Closed
xiaoshengjun opened this issue Nov 5, 2019 · 25 comments
Closed
Assignees
Labels

Comments

@xiaoshengjun
Copy link

Hi, when I trained a new Levenshtein Transformer in my datasets, and it was processed by process.py by '--joined-dictionary', It will show this error, and I trained the model in two gpus. Could you help me find the reasons, thanks very much.

File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 1 terminated with signal SIGSEGV

And when I trained the model in 3 gpus, the error is :

170848, 512Traceback (most recent call last):
File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 2 terminated with signal SIGSEGV
Traceback (most recent call last):
File "", line 1, in
File "/root/anaconda3/envs/len_transform/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/root/anaconda3/envs/len_transform/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated

@xiaoshengjun
Copy link
Author

@xiaoshengjun
Copy link
Author

hi, @sdll , the issues #1308 and #1305 do not solve the problem.

@myleott
Copy link
Contributor

myleott commented Nov 13, 2019

This should be fixed now, can you please try again?

@xiaoshengjun
Copy link
Author

This should be fixed now, can you please try again?

hi, thanks for you replay, and I download the latest version of fairseq, and processed the data ‘--joined-dictionary’, but when I trained a new model, the problem is:

Traceback (most recent call last):
File "train.py", line 337, in
cli_main()
File "train.py", line 329, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 3 terminated with signal SIGSEGV

could you help me?

@MultiPath
Copy link
Contributor

MultiPath commented Nov 22, 2019

seems to be a multi-gpu error? Can you try running on single GPU for debugging?

@xiaoshengjun
Copy link
Author

seems to be a multi-gpu error? Can you try running on single GPU for debugging?

hi, I have used in single GPU, when the model loaded the trained data, the code is over, but it does not show anything, and I can the GPU is not used.

@MultiPath
Copy link
Contributor

seems to be a multi-gpu error? Can you try running on single GPU for debugging?

hi, I have used in single GPU, when the model loaded the trained data, the code is over, but it does not show anything, and I can the GPU is not used.

@xiaoshengjun just following up. Is the problem still there in the recent refactored codebase?

@kalyangvs
Copy link
Contributor

@MultiPath this still occurs in recent codebase.
Even without apex. With less no of max tokens and with higher gcc version and nvcc installed.
Installed fairseq via python setup.py build_ext --inplace.
Even on single GPU, it is segmentation fault.
Please recheck.

@myleott
Copy link
Contributor

myleott commented Dec 17, 2019

I just ran the example command in the README and it works fine.

Can you confirm that libnat is built properly? Please try running the following:

$ python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

You should see something like: [[[0], [0], [0], [0], [5, 6], [0, 1, 0, 0]]]

@xiaoshengjun
Copy link
Author

python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

Thanks for your reply, and I have run the command what you said above, and get :

from fairseq import libnat
print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))
Segmentation fault

so something is wrong in my environment?

@xiaoshengjun
Copy link
Author

seems to be a multi-gpu error? Can you try running on single GPU for debugging?

hi, I have used in single GPU, when the model loaded the trained data, the code is over, but it does not show anything, and I can the GPU is not used.

@xiaoshengjun just following up. Is the problem still there in the recent refactored codebase?

yes, I also meet the problem.

@myleott myleott changed the title Train a new Levenshtein Transformer? Trouble building libnat Dec 18, 2019
@myleott
Copy link
Contributor

myleott commented Dec 18, 2019

so something is wrong in my environment?

Yes, libnat has not been built properly. Can you please run python setup.py build_ext --inplace and share the output here?

@myleott myleott self-assigned this Dec 18, 2019
@xiaoshengjun
Copy link
Author

so something is wrong in my environment?

Yes, libnat has not been built properly. Can you please run python setup.py build_ext --inplace and share the output here?

hi, the output is:

python setup.py build_ext --inplace

which: no nvcc in (/root/anaconda3/envs/fq09py12/bin:/root/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/ibutils/bin:/usr/bin:/usr/local/bin:/usr/libexec/git-core:/root/bin)
running build_ext
/root/anaconda3/envs/fq09py12/lib/python3.6/site-packages/torch/utils/cpp_extension.py:196: UserWarning:

                           !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (g++ 4.8.5) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 4.9 and above.
See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.

See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
for instructions on how to install GCC 4.9 or higher.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                          !! WARNING !!

warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
skipping 'fairseq/data/data_utils_fast.cpp' Cython extension (up-to-date)
skipping 'fairseq/data/token_block_utils_fast.cpp' Cython extension (up-to-date)
copying build/lib.linux-x86_64-3.6/fairseq/libbleu.cpython-36m-x86_64-linux-gnu.so -> fairseq
copying build/lib.linux-x86_64-3.6/fairseq/data/data_utils_fast.cpython-36m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.6/fairseq/data/token_block_utils_fast.cpython-36m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.6/fairseq/libnat.cpython-36m-x86_64-linux-gnu.so -> fairseq

@myleott
Copy link
Contributor

myleott commented Dec 18, 2019

Please follow the instructions in the warning message :)

@kalyangvs
Copy link
Contributor

kalyangvs commented Dec 19, 2019

so something is wrong in my environment?

Yes, libnat has not been built properly. Can you please run python setup.py build_ext --inplace and share the output here?

My output:
running build_ext
skipping 'fairseq/data/data_utils_fast.cpp' Cython extension (up-to-date)
skipping 'fairseq/data/token_block_utils_fast.cpp' Cython extension (up-to-date)
copying build/lib.linux-x86_64-3.7/fairseq/libbleu.cpython-37m-x86_64-linux-gnu.so -> fairseq
copying build/lib.linux-x86_64-3.7/fairseq/data/data_utils_fast.cpython-37m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.7/fairseq/data/token_block_utils_fast.cpython-37m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.7/fairseq/libnat.cpython-37m-x86_64-linux-gnu.so -> fairseq

And the snippet python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))' gives Segmentation fault.

Recloned the Repo and tried installing and it gives the following error ::
Complete output (12 lines):
running develop
running egg_info
writing fairseq.egg-info/PKG-INFO
writing dependency_links to fairseq.egg-info/dependency_links.txt
writing entry points to fairseq.egg-info/entry_points.txt
writing requirements to fairseq.egg-info/requires.txt
writing top-level names to fairseq.egg-info/top_level.txt
reading manifest file 'fairseq.egg-info/SOURCES.txt'
writing manifest file 'fairseq.egg-info/SOURCES.txt'
running build_ext
cythoning fairseq/data/data_utils_fast.pyx to fairseq/data/data_utils_fast.cpp
error: /home/workspace/nat/fairseq/fairseq/data/data_utils_fast.pyx

@kalyangvs
Copy link
Contributor

Steps ::

  1. conda create -n nat python=3.7 && conda activate nat
  2. git clone https://github.com/pytorch/fairseq.git && cd fairseq
  3. conda install -c psi4 gcc-5
  4. conda install libgcc -y
  5. pip install torch && pip install cython
  6. python setup.py build_ext
  7. python setup.py install --user

@myleott
Copy link
Contributor

myleott commented Dec 19, 2019

Can you install gcc from the main channel instead? Something like:

conda install gcc_linux-64 gxx_linux-64

There should be versions for other platforms/architectures too.

Here's what I just ran and it works as expected:

  1. conda create -n nat python=3.7 && conda activate nat
  2. git clone https://github.com/pytorch/fairseq.git && cd fairseq
  3. conda install gcc_linux-64 gxx_linux-64
  4. pip install torch && pip install cython
  5. python setup.py build_ext --inplace
  6. python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

@xiaoshengjun
Copy link
Author

Please follow the instructions in the warning message :)

The problem has been solved after I update the version of gcc, thanks.

@kalyangvs
Copy link
Contributor

Can you install gcc from the main channel instead? Something like:

conda install gcc_linux-64 gxx_linux-64

There should be versions for other platforms/architectures too.

Here's what I just ran and it works as expected:

1. conda create -n nat python=3.7 && conda activate nat

2. git clone https://github.com/pytorch/fairseq.git && cd fairseq

3. conda install gcc_linux-64 gxx_linux-64

4. pip install torch && pip install cython

5. python setup.py build_ext --inplace

6. python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

Solved Thanks.

@myleott myleott closed this as completed Dec 20, 2019
@kalyangvs
Copy link
Contributor

@myleott when I run scaling-nmt translation example, https://github.com/pytorch/fairseq/blame/master/examples/scaling_nmt/README.md#L40 in the above conda environment along with --encoder-layerdrop 0.2 the gpu utilisation amounts to 100 and training does not start and is stuck but with the --ddp-backend no_c10d it is working fine.
Whereas without --encoder/decoder-layerdrop it works with/without ddp-backend.
Is this intended behaviour?

@myleott
Copy link
Contributor

myleott commented Dec 25, 2019

Yes, no_c10d is required when some of the model parameters are not used in the forward pass, as is the case with LayerDrop.

@luofuli
Copy link

luofuli commented May 13, 2020

Can you install gcc from the main channel instead? Something like:

conda install gcc_linux-64 gxx_linux-64

There should be versions for other platforms/architectures too.

Here's what I just ran and it works as expected:

  1. conda create -n nat python=3.7 && conda activate nat
  2. git clone https://github.com/pytorch/fairseq.git && cd fairseq
  3. conda install gcc_linux-64 gxx_linux-64
  4. pip install torch && pip install cython
  5. python setup.py build_ext --inplace
  6. python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

great! It works!!!!!!!!

facebook-github-bot pushed a commit that referenced this issue Oct 14, 2020
Summary: Pull Request resolved: fairinternal/fairseq-py#1346

Reviewed By: xianxl

Differential Revision: D24306363

Pulled By: myleott

fbshipit-source-id: 90c4b59031f04b925ad12a13a96d9225ab0a09b4
mawright added a commit to mawright/fairseq that referenced this issue Oct 21, 2020
mawright added a commit to mawright/fairseq that referenced this issue Oct 27, 2020
sshleifer pushed a commit that referenced this issue Apr 7, 2021
Summary: Pull Request resolved: fairinternal/fairseq-py#1346

Reviewed By: xianxl

Differential Revision: D24306363

Pulled By: myleott

fbshipit-source-id: 90c4b59031f04b925ad12a13a96d9225ab0a09b4
@KinWaiCheuk
Copy link

KinWaiCheuk commented Jul 29, 2021

Can you install gcc from the main channel instead? Something like:

conda install gcc_linux-64 gxx_linux-64

There should be versions for other platforms/architectures too.

Here's what I just ran and it works as expected:

  1. conda create -n nat python=3.7 && conda activate nat
  2. git clone https://github.com/pytorch/fairseq.git && cd fairseq
  3. conda install gcc_linux-64 gxx_linux-64
  4. pip install torch && pip install cython
  5. python setup.py build_ext --inplace
  6. python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

For me, after finishing step 1-5, I get the following error at step 6.

Traceback (most recent call last):
File "", line 1, in
ImportError: /workspace/public_data/raven/fairseq/fairseq/libnat.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object

@KinWaiCheuk
Copy link

KinWaiCheuk commented Jul 29, 2021

Can you install gcc from the main channel instead? Something like:

conda install gcc_linux-64 gxx_linux-64

There should be versions for other platforms/architectures too.
Here's what I just ran and it works as expected:

  1. conda create -n nat python=3.7 && conda activate nat
  2. git clone https://github.com/pytorch/fairseq.git && cd fairseq
  3. conda install gcc_linux-64 gxx_linux-64
  4. pip install torch && pip install cython
  5. python setup.py build_ext --inplace
  6. python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

For me, after finishing step 1-5, I get the following error at step 6.

Traceback (most recent call last): File "", line 1, in ImportError: /workspace/public_data/raven/fairseq/fairseq/libnat.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object

I have figured out the solution. It is due to the permission problem of a mounted disk. If my git clone the fairseq repo inside the mounted disk, and do pip install --editable ./, then I will get the above mentioned error.

If I git clone the repo inside my system disk and do pip install --editable ./, this error is gone.

Solution 1:

According to this stackoverflow question, you can check if you can fix the disk permission by doing

sudo umount /data
sudo mount -o exec /dev/sda4 /data

You should change /data and /dev/sda4 to your mount location and disk location respectively.
If it works, then it's great. For me, it does not work this easily.

Solution 2:

BE CAREFUL, this solution will wipe out all data on your mounted disk. Please backup all your files before you proceed!

For my case, I had some problem with my mounted disk. I need to reformat it using the following command
mkfs.ext4 /dev/sda4

After reformatting, I can mount with the following comment
sudo mount -o exec /dev/sda4 /data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants