process 0 terminated with signal SIGSEGV #1608

duyvuleo · 2020-01-10T04:51:48Z

Hi,

I encountered the following error when trying to run training ROBERTA from scratch.

| model roberta_base, criterion MaskedLmLoss
| num. model params: 124899681 (num. trained: 124899681)
| training on 2 GPUs
| max tokens per GPU = None and max sentences per GPU = 8
| no existing checkpoint found checkpoints/checkpoint_last.pt
| loading train data for epoch 0
| loaded 171332193 examples from: /exp/fairseq/data/train
| loaded 12317550 blocks from: /exp/fairseq/data/train
| WARNING: 10358 samples have invalid sizes and will be skipped, max_positions=512, first few sample ids=[11863898, 11942383, 12142578, 7900756, 11859939, 11939476, 11783489, 11889611, 6580617, 5529364]
| using FusedAdam
Traceback (most recent call last):
File /tools/pyvenv3-gpu-torch/bin/fairseq-train", line 11, in
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/code/fairseq/fairseq_cli/train.py", line 355, in cli_main
nprocs=args.distributed_world_size,
File "/tools/pyvenv3-gpu-torch/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/tools/pyvenv3-gpu-torch/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 0 terminated with signal SIGSEGV

Please advise what the error is. Thanks!

My environment:

fairseq Version: master
PyTorch Version: 1.3.1
OS: Linux
How you installed fairseq (pip, source): pip install -e .
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version: 10.0
GPU models and configuration: Tesla P100
Any other relevant information:

The text was updated successfully, but these errors were encountered:

myleott · 2020-01-10T14:57:35Z

Does other PyTorch CUDA code work in your environment?
Can you try single GPU training? CUDA_VISIBLE_DEVICES=0 fairseq-train (...)

kalyangvs · 2020-01-11T07:41:39Z

If this python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))' gives an error, the following might be a solution.

Please refer this issue.
Follow these steps.

duyvuleo · 2020-01-17T20:17:40Z

Thanks guy for your advice. I managed to run it.

kalyangvs · 2020-02-20T03:42:30Z

@duyvuleo please help to figure this out, #1720

brando90 · 2021-02-18T20:54:30Z

Thanks guy for your advice. I managed to run it.

how did you mange to run it? what did you do? @duyvuleo

duyvuleo added needs triage question labels Jan 10, 2020

myleott removed the needs triage label Jan 10, 2020

myleott self-assigned this Jan 10, 2020

duyvuleo closed this as completed Jan 17, 2020

lucidrains mentioned this issue Jul 9, 2020

Error when running the example code lucidrains/byol-pytorch#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process 0 terminated with signal SIGSEGV #1608

process 0 terminated with signal SIGSEGV #1608

duyvuleo commented Jan 10, 2020 •

edited

Loading

myleott commented Jan 10, 2020 •

edited

Loading

kalyangvs commented Jan 11, 2020

duyvuleo commented Jan 17, 2020

kalyangvs commented Feb 20, 2020

brando90 commented Feb 18, 2021

process 0 terminated with signal SIGSEGV #1608

process 0 terminated with signal SIGSEGV #1608

Comments

duyvuleo commented Jan 10, 2020 • edited Loading

myleott commented Jan 10, 2020 • edited Loading

kalyangvs commented Jan 11, 2020

duyvuleo commented Jan 17, 2020

kalyangvs commented Feb 20, 2020

brando90 commented Feb 18, 2021

duyvuleo commented Jan 10, 2020 •

edited

Loading

myleott commented Jan 10, 2020 •

edited

Loading