Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV while running train.py on a multi GPU setup #12

Closed
chandraka opened this issue Oct 25, 2019 · 15 comments
Closed

SIGSEGV while running train.py on a multi GPU setup #12

chandraka opened this issue Oct 25, 2019 · 15 comments

Comments

@chandraka
Copy link

chandraka commented Oct 25, 2019

I have setup a ubuntu 18.04 4 CPU and 4 GPU environment to execute the librispeech dataset training.

The prepare step went through fine.

But when I launch the training using:
python train.py ./librispeech-workdir/preprocessed-data/ --save-dir ./librispeech-workdir/train-output/ --max-epoch 80 --task speech_recognition_e --arch vggtransformer_2 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0 --max-tokens 5000 --log-format json --log-interval 1 --criterion cross_entropy_acc --user-dir examples/speech_recognition/

I get the following error right at the outset:
)

| model vggtransformer_2, criterion CrossEntropyWithAccCriterion
| num. model params: 315190057 (num. trained: 315190057)
| training on 4 GPUs
| max tokens per GPU = 5000 and max sentences per GPU = None
| no existing checkpoint found ./librispeech-workdir/train-output/checkpoint_last.pt
| loading train data for epoch 0
Traceback (most recent call last):
File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 0 terminated with signal SIGSEGV

Unable to proceed ahead in teh absence of any clues a to what might be causing it etc

Please help

It starts out with


| distributed init (rank 3): tcp://localhost:15160
| distributed init (rank 0): tcp://localhost:15160
| distributed init (rank 2): tcp://localhost:15160
| distributed init (rank 1): tcp://localhost:15160
| initialized host espresso-2 as rank 2
| initialized host espresso-2 as rank 1
| initialized host espresso-2 as rank 3
| initialized host espresso-2 as rank 0
Namespace(adadelta_eps=1e-08, adadelta_rho=0.95, anneal_eps=False, arch='vggtransformer_2', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=
10.0, conv_dec_config='((256, 3, True),) * 4', cpu=False, criterion='cross_entropy_acc', curriculum=0, data='./librispeech-workdir/preprocessed-data/', dataset_i
mpl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:15160', distributed_no_
spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, empty_cache_freq=0, enc_output_dim=1024, fast_stat_sync=False, find_unused_parame
ters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_windo
w=None, input_feat_per_channel=80, keep_interval_updates=-1, keep_last_epochs=-1, log_format='json', log_interval=1, lr=[1.0], lr_scheduler='fixed', lr_shrink=0.
1, max_epoch=80, max_sentences=None, max_sentences_valid=None, max_tokens=5000, max_tokens_valid=5000, max_update=0, maximize_best_checkpoint_metric=False, memor
y_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_op
timizer_state=False, num_workers=1, optimizer='adadelta', optimizer_overrides='{}', required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=Fa
lse, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./librispeech-workdir/train-output/', save_interval=1, save_interval
_updates=0, seed=1, sentence_avg=False, silence_token='▁', skip_invalid_size_inputs_valid_test=False, task='speech_recognition_e', tbmf_wrapper=False, tensorboar
d_logdir='', tgt_embed_dim=512, threshold_loss_scale=None, tokenizer=None, train_subset='train', transformer_dec_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15
),) * 6', transformer_enc_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15),) * 16', update_freq=[1], use_bmuf=False, user_dir='examples/speech_recognition/', va
lid_subset='valid', validate_interval=1, vggblock_enc_config='[(64, 3, 2, 2, True), (128, 3, 2, 2, True)]', warmup_updates=0, weight_decay=0.0)
| dictionary: 5001 types


(I have had to rename the speech_recognition task to speech_recognition_e as there is a similarly named task in fairseq directory as well)

@chandraka
Copy link
Author

I have confirmed that this happens ONLY when Two or Four GPUs are involved. When I run with one GPU it works fine.

So it is probably the multiprocessing parts. Could you please throw some light on fixing this ...

@freewym
Copy link
Owner

freewym commented Oct 25, 2019

You are using the ASR recipe from fairseq (those in examples/speech_recognition). If you have questions regarding that, you'd probably better ask them.

@chandraka
Copy link
Author

I will post this on that forum. Thank you.

But any hints for me otherwise on where to look or what to fix?

@freewym
Copy link
Owner

freewym commented Oct 26, 2019 via email

@chandraka
Copy link
Author

I am getting some other error with that example:

https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip successfully downloaded.
Archive: wikitext-103-v1.zip
creating: wikitext-103/
inflating: wikitext-103/wiki.test.tokens
inflating: wikitext-103/wiki.valid.tokens
inflating: wikitext-103/wiki.train.tokens
(base) chandraka@espresso-2:/espresso/examples/language_model$ cd ../..
(base) chandraka@espresso-2:
/espresso$ find . -name 'speech_recognition.py'^C
(base) chandraka@espresso-2:/espresso$ TEXT=examples/language_model/wikitext-103
(base) chandraka@espresso-2:
/espresso$ fairseq-preprocess \

--only-source \
--trainpref $TEXT/wiki.train.tokens \
--validpref $TEXT/wiki.valid.tokens \
--testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103 \
--workers 20

Namespace(align_suffix=None, alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/wikitext-103', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer='nag', padding_factor=8, seed=1, source_lang=None, srcdict=None, target_lang=None, task='translation', tbmf_wrapper=False, tensorboard_logdir='', testpref='examples/language_model/wikitext-103/wiki.test.tokens', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='examples/language_model/wikitext-103/wiki.train.tokens', user_dir=None, validpref='examples/language_model/wikitext-103/wiki.valid.tokens', workers=20)
| [None] Dictionary: 267743 types
| [None] examples/language_model/wikitext-103/wiki.train.tokens: 1801350 sents, 103227021 tokens, 0.0% replaced by
| [None] Dictionary: 267743 types
| [None] examples/language_model/wikitext-103/wiki.valid.tokens: 3760 sents, 217646 tokens, 0.0% replaced by
| [None] Dictionary: 267743 types
| [None] examples/language_model/wikitext-103/wiki.test.tokens: 4358 sents, 245569 tokens, 0.0% replaced by
| Wrote preprocessed data to data-bin/wikitext-103
(base) chandraka@espresso-2:/espresso$
(base) chandraka@espresso-2:
/espresso$
(base) chandraka@espresso-2:/espresso$
(base) chandraka@espresso-2:
/espresso$ fairseq-eval-lm data-bin/wikitext-103 \

--path checkpoints/transformer_wiki103/checkpoint_best.pt \
--sample-break-mode complete --max-tokens 3072 \
--context-window 2560 --softmax-batch 1024

Namespace(add_bos_token=False, bpe=None, context_window=2560, cpu=False, criterion='cross_entropy', data='data-bin/wikitext-103', dataset_impl=None, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False, gen_subset='test', lazy_load=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, max_sentences=None, max_target_positions=None, max_tokens=3072, memory_efficient_fp16=False, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, no_progress_bar=False, num_shards=1, num_workers=1, optimizer='nag', output_dictionary_size=-1, output_word_probs=False, output_word_stats=False, past_target=False, path='checkpoints/transformer_wiki103/checkpoint_best.pt', quiet=False, raw_text=False, remove_bpe=None, required_batch_size_multiple=8, results_path=None, sample_break_mode='complete', seed=1, self_target=False, shard_id=0, skip_invalid_size_inputs_valid_test=False, softmax_batch=1024, task='language_modeling', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, user_dir=None, warmup_updates=0, weight_decay=0.0)
| dictionary: 267744 types
| loading model(s) from checkpoints/transformer_wiki103/checkpoint_best.pt
Traceback (most recent call last):
File "/home/chandraka/anaconda3/bin/fairseq-eval-lm", line 11, in
load_entry_point('fairseq', 'console_scripts', 'fairseq-eval-lm')()
File "/home/chandraka/espresso/fairseq_cli/eval_lm.py", line 223, in cli_main
main(args)
File "/home/chandraka/espresso/fairseq_cli/eval_lm.py", line 62, in main
task=task,
File "/home/chandraka/espresso/fairseq/checkpoint_utils.py", line 167, in load_model_ensemble
ensemble, args, _task = load_model_ensemble_and_task(filenames, arg_overrides, task)
File "/home/chandraka/espresso/fairseq/checkpoint_utils.py", line 177, in load_model_ensemble_and_task
raise IOError('Model file not found: {}'.format(filename))
OSError: Model file not found: checkpoints/transformer_wiki103/checkpoint_best.pt
(base) chandraka@espresso-2:~/espresso$

@freewym
Copy link
Owner

freewym commented Oct 26, 2019

You are running eval. You should run training.

@chandraka
Copy link
Author

chandraka commented Oct 26, 2019

Sorry my bad. I skipped a section in README :).

No SIGSEG. But I keep getting


| WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 11.17 GiB total capacity; 10.32 GiB already allocated; 267.06 MiB free; 10.63 GiB reserved in total by PyTorch);
Skipping batch


I have already done (based on google searches):


sudo mount -o remount,size=42949672960 /dev/shm
ulimit -n 65535
ulimit -s 81920


@freewym
Copy link
Owner

freewym commented Oct 26, 2019

Reduce --max-tokens to a smaller value

@chandraka
Copy link
Author

Now it is running fine. I dont see any output apart from some iniital reporting -- but I see that the python instances are keeping the server busy. No SIGSEGV.

@freewym
Copy link
Owner

freewym commented Oct 26, 2019

OK. I guess SIGSEGV may be related to the way they load the data.Not sure though. Maybe ask them if someone has done it with 2 or 4 GPUs for their speech recognition task.

@chandraka
Copy link
Author

Ok. Thank you.

@chandraka
Copy link
Author

@phtephanx
Copy link

phtephanx commented Nov 3, 2019

@chandraka
I'm trying out examples/speech_recognition, as well. While the preparation step worked, starting the training using the specified command like so

python train.py ./librispeech_processed/ --save-dir examples/speech_recognition/build/model/ --max-epoch 80 --task speech_recognition --arch vggtransformer_2 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0  --max-tokens 5000 --log-format json --log-interval 1 --criterion cross_entropy_acc --user-dir espresso/examples/speech_recognition/

results in

ValueError: Cannot register task with duplicate class name (SpeechRecognitionTask)

The problem is that the task "speech_recognition" is registered twice (via the same name). Under fairseq.tasks and under examples.speech_recognition.tasks. How did you circumvent that?

@chandraka
Copy link
Author

chandraka commented Nov 5, 2019

@phtephanx

I encountered this as well. Then based on a grep I found that there were TWO speech_recognition.py files with the same class. So I renamed the one in example/speech_recognition/tasks to speech_recognition_e.py (e for espresso), and a similar _E was added to the directive above the class declaration in the file (and class was renamed with a suffix of E as well).

And I modified my command to add the _e as well (--task).

@phtephanx
Copy link

@chandraka
Thank you, this does the job!

@freewym freewym closed this as completed Dec 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants