-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consultation about code and details. #20
Comments
Hi, thank you for your interest in our work!
I hope this answers your questions! Please don’t hesitate to reach out if you have further inquiries, and feel free to star the repository if you find it helpful! |
Thanks for your valuable reply! It has resolved my previous confusion. However, I would like to ask if you have conducted any ablation studies regarding batch size? When using conventional batch sizes such as 2048 or 8192 in CLIP training, do the designs of projection and loss in this paper still work? |
Yes, the conclusion applies to smaller batches as well. We do abalte on the batch size and conlcude that larger batch size yield better results, but once it bs>32k, then the performance saturates and starts to decline. That's why we choose bs=32768 as optimal hyperparameters. |
As you mentioned, the embeddings of DINOv2 consist of 2 tokens, namely 1 CLS token and 1 averaged token of patch tokens. So when calculating the modified siglip loss, do the embeddings of DINOv2 need to be merged into a single token through some method before participating in the loss calculation? I did not find any part in the code that merges the embeddings of DINOv2. Thanks for your patient answer😊 |
Sure, for DINOv2, we simply take average of all patch tokens, and then concat them thourgh d dimension, the code is at https://github.com/lezhang7/SAIL/blob/main/model/vision_model.py#L203
|
I noticed that you used the nn.ReLU6(), which is one type of Hardtanh activation function, in the alignment module. When training with BFloat16, I encountered a |
I haven't tried BF16 training, I use FP16 instead. There seem to be many ways to avoid this error, here's reply from chatgpt: It looks like the issue arises because Potential Solutions:
Recommendation:If Let me know if you need further clarification! 🚀 |
Thank you for your reply! I would like to ask about the range of loss when training with a batch size of 32,768 on a single GPU. Considering that you are using a modified sigliploss, where the average is divided by the square of the batch size, is the loss expected to be a very small value? |
Yes it is. Here's the training curve https://api.wandb.ai/links/le-zhang/gzd362g8, the loss is very small |
Thank you for your help; it's very thoughtful! I am reproducing the gte-large-en-v1.5 version of SAIL in my own codebase (training on a single GPU like you) and would like to confirm some training details with you:
There are quite a few questions. If you could provide the training curve of the GTE version of SAIL with a batch size of 6,192, it would resolve most of my doubts. Thank you very much! |
You are welcome. Yes, we only optimize the contrastive_loss, we have tried to optimize additional loss but they didn't work out. Here's the training curve of gte https://api.wandb.ai/links/le-zhang/isn70xbd. The loss will surge up initialliy, then goes down after some steps. May I ask which dataset you are training on? If you use smaller batch size the loss might goes down with more steps. |
you can also manually verify if the sentences match with the images during encoding them into embeddings. |
I just train on CC3M with raw captions. It is the setting of row 5 in Tab.1, right? And I observe that, although you trained with fp16 precision, the optimizer is updated with amp because Line 113 in e7ca112
|
Thank you for your excellent work, but I have some questions.
The text was updated successfully, but these errors were encountered: