Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consultation about code and details. #20

Open
LuFan31 opened this issue Dec 15, 2024 · 13 comments
Open

Consultation about code and details. #20

LuFan31 opened this issue Dec 15, 2024 · 13 comments

Comments

@LuFan31
Copy link

LuFan31 commented Dec 15, 2024

Thank you for your excellent work, but I have some questions.

  1. From Tab. 1 of the paper, it seems that the method applied a single-layer linear layer with GLU activation, but there appears to be no mention of the GLU activation function in the code. Have I missed something or misunderstood?
  2. Have you tried other language models like T5 as the text encoder? Should GTE-en-large-v1.5 and NV-Embed-v2 be selected because they can output a CLS token for alignment with the image's CLS token?
  3. At the Alignment Tuning stage, are only the pre-encoded image's and text's CLS tokens loaded onto the GPU?
@lezhang7
Copy link
Owner

Hi, thank you for your interest in our work!

  1. GLU stands for Gated Linear Unit. It's not an activation function. You can find its implementation in our codebase here: [GLU Implementation](https://github.com/lezhang7/SAIL/blob/main/model/linear.py#L7).

  2. In the paper, we mainly tested text embedding language models trained to learn text embeddings as text encoders. T5, while it incorporates an encoder, is primarily used as a text decoder model and is rarely used to encode text embeddings. Regarding how global sentence embeddings are represented, we follow the specific recipe of each model. You can refer to the pooling strategies for each model here: [Pooling Strategy](https://github.com/lezhang7/SAIL/blob/main/model/language_model.py#L33).

  3. We extract both vision and text embeddings. For some models, like DINOv2, embeddings are formed by concatenating the [CLS] token with the average of patch tokens. However, this may vary depending on the model used. You can check the exact implementation for each model here: [Vision Embedding Extraction](https://github.com/lezhang7/SAIL/blob/main/model/vision_model.py#L147).

I hope this answers your questions! Please don’t hesitate to reach out if you have further inquiries, and feel free to star the repository if you find it helpful!

@LuFan31
Copy link
Author

LuFan31 commented Dec 17, 2024

Thanks for your valuable reply! It has resolved my previous confusion. However, I would like to ask if you have conducted any ablation studies regarding batch size? When using conventional batch sizes such as 2048 or 8192 in CLIP training, do the designs of projection and loss in this paper still work?

@lezhang7
Copy link
Owner

Yes, the conclusion applies to smaller batches as well. We do abalte on the batch size and conlcude that larger batch size yield better results, but once it bs>32k, then the performance saturates and starts to decline. That's why we choose bs=32768 as optimal hyperparameters.

@LuFan31
Copy link
Author

LuFan31 commented Dec 22, 2024

As you mentioned, the embeddings of DINOv2 consist of 2 tokens, namely 1 CLS token and 1 averaged token of patch tokens. So when calculating the modified siglip loss, do the embeddings of DINOv2 need to be merged into a single token through some method before participating in the loss calculation? I did not find any part in the code that merges the embeddings of DINOv2. Thanks for your patient answer😊

@lezhang7
Copy link
Owner

lezhang7 commented Dec 22, 2024

Sure, for DINOv2, we simply take average of all patch tokens, and then concat them thourgh d dimension, the code is at https://github.com/lezhang7/SAIL/blob/main/model/vision_model.py#L203

embedding = torch.cat([cls_token, patch_tokens.mean(dim=1)], dim=1) else:

@LuFan31
Copy link
Author

LuFan31 commented Jan 7, 2025

I noticed that you used the nn.ReLU6(), which is one type of Hardtanh activation function, in the alignment module. When training with BFloat16, I encountered a RuntimeError: "hardtanh_backward_cuda" not implemented for 'BFloat16'. Have you tried training with BFloat16 or encountered such an error?

@lezhang7
Copy link
Owner

lezhang7 commented Jan 7, 2025

I haven't tried BF16 training, I use FP16 instead. There seem to be many ways to avoid this error, here's reply from chatgpt:

It looks like the issue arises because nn.ReLU6() internally uses the Hardtanh activation function, which does not support BFloat16 (BFloat16) on CUDA. This can cause the "hardtanh_backward_cuda not implemented for 'BFloat16'" error when training with mixed or low precision.

Potential Solutions:

  1. Replace ReLU6() with ReLU()
    Since ReLU6(x) = min(max(0, x), 6), but ReLU() simply clips at zero, you might not need the upper bound (6) in many cases. Try:

    alignment_layer = nn.ReLU()

    ReLU is fully supported in BFloat16 and should avoid the error.

  2. Use F.hardtanh() with Explicit dtype=torch.float32
    If you need ReLU6, you can explicitly cast the input to float32:

    import torch.nn.functional as F
    
    x = x.float()  # Convert to float32 before activation
    x = F.hardtanh(x, min_val=0, max_val=6)
    x = x.to(torch.bfloat16)  # Convert back to BFloat16 if needed
  3. Manually Implement ReLU6 Without Hardtanh
    If you require ReLU6 but want to avoid Hardtanh, define a custom function:

    class ReLU6Function(torch.autograd.Function):
        @staticmethod
        def forward(ctx, x):
            return x.clamp(0, 6)
    
        @staticmethod
        def backward(ctx, grad_output):
            return grad_output.clone()
    
    class CustomReLU6(nn.Module):
        def forward(self, x):
            return ReLU6Function.apply(x)
  4. Train in Mixed Precision (torch.cuda.amp)
    If your model supports automatic mixed precision (AMP), you can use:

    with torch.cuda.amp.autocast(dtype=torch.float32):
        x = F.relu6(x)

    This forces ReLU6 to run in float32 while keeping the rest in BFloat16.

Recommendation:

If ReLU() works fine for your alignment module, replacing ReLU6() with ReLU() is the simplest solution. If you need the 6 cap, consider using F.hardtanh() with casting or a custom function.

Let me know if you need further clarification! 🚀

@LuFan31
Copy link
Author

LuFan31 commented Jan 7, 2025

Thank you for your reply! I would like to ask about the range of loss when training with a batch size of 32,768 on a single GPU. Considering that you are using a modified sigliploss, where the average is divided by the square of the batch size, is the loss expected to be a very small value?

@lezhang7
Copy link
Owner

lezhang7 commented Jan 7, 2025

Yes it is. Here's the training curve https://api.wandb.ai/links/le-zhang/gzd362g8, the loss is very small

@LuFan31
Copy link
Author

LuFan31 commented Jan 14, 2025

Thank you for your help; it's very thoughtful! I am reproducing the gte-large-en-v1.5 version of SAIL in my own codebase (training on a single GPU like you) and would like to confirm some training details with you:

  1. Did you use only loss['contrastive_loss'] to optimize the alignment layer, just like

    total_loss = losses['contrastive_loss']
    ?

  2. I noticed that the loss['contrastive_loss'] for the first step is inversely proportional to the batchsize during training. Is this correct?
    Since I did not pre-extract image and text embeddings, I am training with a batchsize of 6,192 using FP16 precision (approximately one-fifth of 32,768). In the initial stage, the loss is roughly five times the loss shown in the curve you provided. Although you provided the training curve for the NV version, considering that only the parameters of the alignment layer are being trained, the initial loss values for different versions of SAIL should be approximately the same, right?

  3. Does the GTE version of SAIL converge more slowly? In my current setting (except for the batchsize of 6,192, all other hyperparameters and optimizers follow your settings), the model has not converged by the second epoch.

There are quite a few questions. If you could provide the training curve of the GTE version of SAIL with a batch size of 6,192, it would resolve most of my doubts. Thank you very much!

@lezhang7
Copy link
Owner

You are welcome. Yes, we only optimize the contrastive_loss, we have tried to optimize additional loss but they didn't work out.

Here's the training curve of gte https://api.wandb.ai/links/le-zhang/isn70xbd. The loss will surge up initialliy, then goes down after some steps. May I ask which dataset you are training on? If you use smaller batch size the loss might goes down with more steps.

@lezhang7
Copy link
Owner

you can also manually verify if the sentences match with the images during encoding them into embeddings.

@LuFan31
Copy link
Author

LuFan31 commented Jan 14, 2025

I just train on CC3M with raw captions. It is the setting of row 5 in Tab.1, right?
image

And I observe that, although you trained with fp16 precision, the optimizer is updated with amp because args.precision defaults to 'amp' in the code, which makes the scaler be set as torch.amp.GradScaler() in:

scaler.step(optimizer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants