Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process 0 terminated with signal SIGSEGV #1608

Closed
duyvuleo opened this issue Jan 10, 2020 · 5 comments
Closed

process 0 terminated with signal SIGSEGV #1608

duyvuleo opened this issue Jan 10, 2020 · 5 comments
Assignees
Labels

Comments

@duyvuleo
Copy link

duyvuleo commented Jan 10, 2020

Hi,

I encountered the following error when trying to run training ROBERTA from scratch.

| model roberta_base, criterion MaskedLmLoss
| num. model params: 124899681 (num. trained: 124899681)
| training on 2 GPUs
| max tokens per GPU = None and max sentences per GPU = 8
| no existing checkpoint found checkpoints/checkpoint_last.pt
| loading train data for epoch 0
| loaded 171332193 examples from: /exp/fairseq/data/train
| loaded 12317550 blocks from: /exp/fairseq/data/train
| WARNING: 10358 samples have invalid sizes and will be skipped, max_positions=512, first few sample ids=[11863898, 11942383, 12142578, 7900756, 11859939, 11939476, 11783489, 11889611, 6580617, 5529364]
| using FusedAdam
Traceback (most recent call last):
File /tools/pyvenv3-gpu-torch/bin/fairseq-train", line 11, in
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/code/fairseq/fairseq_cli/train.py", line 355, in cli_main
nprocs=args.distributed_world_size,
File "/tools/pyvenv3-gpu-torch/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/tools/pyvenv3-gpu-torch/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 0 terminated with signal SIGSEGV

Please advise what the error is. Thanks!

My environment:

  • fairseq Version: master
  • PyTorch Version: 1.3.1
  • OS: Linux
  • How you installed fairseq (pip, source): pip install -e .
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: 10.0
  • GPU models and configuration: Tesla P100
  • Any other relevant information:
@myleott
Copy link
Contributor

myleott commented Jan 10, 2020

  1. Does other PyTorch CUDA code work in your environment?
  2. Can you try single GPU training? CUDA_VISIBLE_DEVICES=0 fairseq-train (...)

@kalyangvs
Copy link
Contributor

If this python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))' gives an error, the following might be a solution.

Please refer this issue.
Follow these steps.

@duyvuleo
Copy link
Author

Thanks guy for your advice. I managed to run it.

@kalyangvs
Copy link
Contributor

@duyvuleo please help to figure this out, #1720

@brando90
Copy link

Thanks guy for your advice. I managed to run it.

how did you mange to run it? what did you do? @duyvuleo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants