Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with multiple GPUs #164

Merged
merged 10 commits into from
Nov 28, 2023
Merged

Training with multiple GPUs #164

merged 10 commits into from
Nov 28, 2023

Conversation

lfoppiano
Copy link
Collaborator

This PR adds the support to multi-gpu, it has been tested on multiple GPU on the same node (4 x 16Gb) allowing to provide a larger batch size.

I wanted to implement it in the trainer but the processing related to the data preparation require to to be under the with strategy.scope():.

I implemented only on the sequence labelling, once it's revised I update the classification as well.

@lfoppiano lfoppiano marked this pull request as ready for review August 22, 2023 05:41
@lfoppiano lfoppiano added the enhancement New feature or request label Aug 22, 2023
@lfoppiano lfoppiano changed the title Training using multiple GPU Training with multiple GPUs Aug 22, 2023
@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Aug 24, 2023

I've extended the support to the other scripts of the sequence labelling.
Overall, I'm not sure how useful this feature is for increasing the batch_size, because the performances are not improving by increasing it on fine-tuning 😭
On the other hand it's definitely nice to have when testing big BERT models because it allow keeping the same parameters (e..g batch_size=20) that would not be possible without multi-gpu.

For example in previous of my tests, the batteryonlybert https://huggingface.co/batterydata/batteryonlybert-cased had better results, but it was due to the use of batch_size=10 to overcome the OOM happening with batch_size=20. Fine-tuning with --multi-gpu and batch_size=20 resulted in lower scores.

@kermitt2
Copy link
Owner

Thank you @lfoppiano ! I was not able to test with a multi GPU settings so just tested with normal single GPU, which is working fine as expected.

I think this is useful as you say for larger models (keeping same batch size), but also for prediction because we can increase the batch size and process more rapidly texts.

@kermitt2
Copy link
Owner

kermitt2 commented Aug 31, 2023

Doing more tests, training is fine but there is failure when writing a model with the -multi-gpu option, having one GPU:

python3 delft/applications/grobidTagger.py date train_eval --architecture BidLSTM_CRF --embedding glove-840B --multi-gpu

....

_________________________________________________________________
    f1 (micro): 95.78
                  precision    recall  f1-score   support

           <day>     0.9091    0.9524    0.9302        42
         <month>     0.9344    0.9661    0.9500        59
          <year>     1.0000    0.9688    0.9841        64

all (micro avg.)     0.9521    0.9636    0.9578       165

model config file saved
preprocessor save
model saved
Exception ignored in: <function Pool.__del__ at 0x7f3fa42ccb80>
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

without -multi-gpu this error does not appear and the model is saved.

Of course, using -multi-gpu when having a single GPU might not be very consistent as a user action!

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Sep 1, 2023

@kermitt2 thanks for testing it. I will add the option for the inference too.

@lfoppiano
Copy link
Collaborator Author

The OSError: [Errno 9] Bad file descriptor seems due to tensorflow/tensorflow#50487 and should be fixed in e169867

@kermitt2 kermitt2 merged commit f8adbf5 into master Nov 28, 2023
1 check passed
@lfoppiano lfoppiano deleted the multi-gpu branch December 17, 2023 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants