-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault with Levenshtein Transformer (build issue with libnat) #1346
Comments
Oh, thank you.
|
This should be fixed now, can you please try again? |
hi, thanks for you replay, and I download the latest version of fairseq, and processed the data ‘--joined-dictionary’, but when I trained a new model, the problem is: Traceback (most recent call last): could you help me? |
seems to be a multi-gpu error? Can you try running on single GPU for debugging? |
hi, I have used in single GPU, when the model loaded the trained data, the code is over, but it does not show anything, and I can the GPU is not used. |
@xiaoshengjun just following up. Is the problem still there in the recent refactored codebase? |
@MultiPath this still occurs in recent codebase. |
I just ran the example command in the README and it works fine. Can you confirm that libnat is built properly? Please try running the following: $ python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))' You should see something like: |
Thanks for your reply, and I have run the command what you said above, and get :
so something is wrong in my environment? |
yes, I also meet the problem. |
Yes, libnat has not been built properly. Can you please run |
hi, the output is: python setup.py build_ext --inplacewhich: no nvcc in (/root/anaconda3/envs/fq09py12/bin:/root/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/ibutils/bin:/usr/bin:/usr/local/bin:/usr/libexec/git-core:/root/bin)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler)) |
Please follow the instructions in the warning message :) |
My output: And the snippet Recloned the Repo and tried installing and it gives the following error :: |
Steps ::
|
Can you install gcc from the main channel instead? Something like:
There should be versions for other platforms/architectures too. Here's what I just ran and it works as expected:
|
The problem has been solved after I update the version of gcc, thanks. |
Solved Thanks. |
@myleott when I run scaling-nmt translation example, https://github.com/pytorch/fairseq/blame/master/examples/scaling_nmt/README.md#L40 in the above conda environment along with --encoder-layerdrop 0.2 the gpu utilisation amounts to 100 and training does not start and is stuck but with the --ddp-backend no_c10d it is working fine. |
Yes, no_c10d is required when some of the model parameters are not used in the forward pass, as is the case with LayerDrop. |
great! It works!!!!!!!! |
Summary: Pull Request resolved: fairinternal/fairseq-py#1346 Reviewed By: xianxl Differential Revision: D24306363 Pulled By: myleott fbshipit-source-id: 90c4b59031f04b925ad12a13a96d9225ab0a09b4
…cebookresearch#1346)" This reverts commit c4d322a.
…cebookresearch#1346)" This reverts commit c4d322a.
Summary: Pull Request resolved: fairinternal/fairseq-py#1346 Reviewed By: xianxl Differential Revision: D24306363 Pulled By: myleott fbshipit-source-id: 90c4b59031f04b925ad12a13a96d9225ab0a09b4
For me, after finishing step 1-5, I get the following error at step 6. Traceback (most recent call last): |
I have figured out the solution. It is due to the permission problem of a mounted disk. If my git clone the fairseq repo inside the mounted disk, and do If I git clone the repo inside my system disk and do Solution 1:According to this stackoverflow question, you can check if you can fix the disk permission by doing
You should change Solution 2:BE CAREFUL, this solution will wipe out all data on your mounted disk. Please backup all your files before you proceed! For my case, I had some problem with my mounted disk. I need to reformat it using the following command After reformatting, I can mount with the following comment |
Hi, when I trained a new Levenshtein Transformer in my datasets, and it was processed by process.py by '--joined-dictionary', It will show this error, and I trained the model in two gpus. Could you help me find the reasons, thanks very much.
File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 1 terminated with signal SIGSEGV
And when I trained the model in 3 gpus, the error is :
170848, 512Traceback (most recent call last):
File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 2 terminated with signal SIGSEGV
Traceback (most recent call last):
File "", line 1, in
File "/root/anaconda3/envs/len_transform/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/root/anaconda3/envs/len_transform/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
The text was updated successfully, but these errors were encountered: