-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV while running train.py on a multi GPU setup #12
Comments
I have confirmed that this happens ONLY when Two or Four GPUs are involved. When I run with one GPU it works fine. So it is probably the multiprocessing parts. Could you please throw some light on fixing this ... |
You are using the ASR recipe from fairseq (those in examples/speech_recognition). If you have questions regarding that, you'd probably better ask them. |
I will post this on that forum. Thank you. But any hints for me otherwise on where to look or what to fix? |
Can you run an LM training using faiseq’s examples with the same number of
GPUs?
On Fri, Oct 25, 2019 at 11:14 PM chandraka ***@***.***> wrote:
I will post this on that forum. Thank you.
But any hints for me otherwise on where to look or what to fix?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#12?email_source=notifications&email_token=AA2YBEVNVMKH7IESCEEVY63QQOYXXA5CNFSM4JFADYA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECJ6GBY#issuecomment-546562823>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2YBEUDFRFGQ2ZEI4YK7ILQQOYXXANCNFSM4JFADYAQ>
.
--
Sent from my iPhone
|
I am getting some other error with that example: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip successfully downloaded.
Namespace(align_suffix=None, alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/wikitext-103', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer='nag', padding_factor=8, seed=1, source_lang=None, srcdict=None, target_lang=None, task='translation', tbmf_wrapper=False, tensorboard_logdir='', testpref='examples/language_model/wikitext-103/wiki.test.tokens', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='examples/language_model/wikitext-103/wiki.train.tokens', user_dir=None, validpref='examples/language_model/wikitext-103/wiki.valid.tokens', workers=20)
Namespace(add_bos_token=False, bpe=None, context_window=2560, cpu=False, criterion='cross_entropy', data='data-bin/wikitext-103', dataset_impl=None, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False, gen_subset='test', lazy_load=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, max_sentences=None, max_target_positions=None, max_tokens=3072, memory_efficient_fp16=False, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, no_progress_bar=False, num_shards=1, num_workers=1, optimizer='nag', output_dictionary_size=-1, output_word_probs=False, output_word_stats=False, past_target=False, path='checkpoints/transformer_wiki103/checkpoint_best.pt', quiet=False, raw_text=False, remove_bpe=None, required_batch_size_multiple=8, results_path=None, sample_break_mode='complete', seed=1, self_target=False, shard_id=0, skip_invalid_size_inputs_valid_test=False, softmax_batch=1024, task='language_modeling', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, user_dir=None, warmup_updates=0, weight_decay=0.0) |
You are running eval. You should run training. |
Sorry my bad. I skipped a section in README :). No SIGSEG. But I keep getting | WARNING: ran out of memory with exception: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 11.17 GiB total capacity; 10.32 GiB already allocated; 267.06 MiB free; 10.63 GiB reserved in total by PyTorch); I have already done (based on google searches): sudo mount -o remount,size=42949672960 /dev/shm |
Reduce |
Now it is running fine. I dont see any output apart from some iniital reporting -- but I see that the python instances are keeping the server busy. No SIGSEGV. |
OK. I guess SIGSEGV may be related to the way they load the data.Not sure though. Maybe ask them if someone has done it with 2 or 4 GPUs for their speech recognition task. |
Ok. Thank you. |
@chandraka
results in
The problem is that the task " |
I encountered this as well. Then based on a grep I found that there were TWO speech_recognition.py files with the same class. So I renamed the one in example/speech_recognition/tasks to speech_recognition_e.py (e for espresso), and a similar _E was added to the directive above the class declaration in the file (and class was renamed with a suffix of E as well). And I modified my command to add the _e as well (--task). |
@chandraka |
I have setup a ubuntu 18.04 4 CPU and 4 GPU environment to execute the librispeech dataset training.
The prepare step went through fine.
But when I launch the training using:
python train.py ./librispeech-workdir/preprocessed-data/ --save-dir ./librispeech-workdir/train-output/ --max-epoch 80 --task speech_recognition_e --arch vggtransformer_2 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0 --max-tokens 5000 --log-format json --log-interval 1 --criterion cross_entropy_acc --user-dir examples/speech_recognition/
I get the following error right at the outset:
)
| model vggtransformer_2, criterion CrossEntropyWithAccCriterion
| num. model params: 315190057 (num. trained: 315190057)
| training on 4 GPUs
| max tokens per GPU = 5000 and max sentences per GPU = None
| no existing checkpoint found ./librispeech-workdir/train-output/checkpoint_last.pt
| loading train data for epoch 0
Traceback (most recent call last):
File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/chandraka/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 0 terminated with signal SIGSEGV
Unable to proceed ahead in teh absence of any clues a to what might be causing it etc
Please help
It starts out with
| distributed init (rank 3): tcp://localhost:15160
| distributed init (rank 0): tcp://localhost:15160
| distributed init (rank 2): tcp://localhost:15160
| distributed init (rank 1): tcp://localhost:15160
| initialized host espresso-2 as rank 2
| initialized host espresso-2 as rank 1
| initialized host espresso-2 as rank 3
| initialized host espresso-2 as rank 0
Namespace(adadelta_eps=1e-08, adadelta_rho=0.95, anneal_eps=False, arch='vggtransformer_2', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=
10.0, conv_dec_config='((256, 3, True),) * 4', cpu=False, criterion='cross_entropy_acc', curriculum=0, data='./librispeech-workdir/preprocessed-data/', dataset_i
mpl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:15160', distributed_no_
spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, empty_cache_freq=0, enc_output_dim=1024, fast_stat_sync=False, find_unused_parame
ters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_windo
w=None, input_feat_per_channel=80, keep_interval_updates=-1, keep_last_epochs=-1, log_format='json', log_interval=1, lr=[1.0], lr_scheduler='fixed', lr_shrink=0.
1, max_epoch=80, max_sentences=None, max_sentences_valid=None, max_tokens=5000, max_tokens_valid=5000, max_update=0, maximize_best_checkpoint_metric=False, memor
y_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_op
timizer_state=False, num_workers=1, optimizer='adadelta', optimizer_overrides='{}', required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=Fa
lse, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./librispeech-workdir/train-output/', save_interval=1, save_interval
_updates=0, seed=1, sentence_avg=False, silence_token='▁', skip_invalid_size_inputs_valid_test=False, task='speech_recognition_e', tbmf_wrapper=False, tensorboar
d_logdir='', tgt_embed_dim=512, threshold_loss_scale=None, tokenizer=None, train_subset='train', transformer_dec_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15
),) * 6', transformer_enc_config='((1024, 16, 4096, True, 0.15, 0.15, 0.15),) * 16', update_freq=[1], use_bmuf=False, user_dir='examples/speech_recognition/', va
lid_subset='valid', validate_interval=1, vggblock_enc_config='[(64, 3, 2, 2, True), (128, 3, 2, 2, True)]', warmup_updates=0, weight_decay=0.0)
| dictionary: 5001 types
(I have had to rename the speech_recognition task to speech_recognition_e as there is a similarly named task in fairseq directory as well)
The text was updated successfully, but these errors were encountered: