You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your awesome work!
After I configured the environment according to the warehouse, I ran the scripts about clipdino336: accelerate launch --num_processes=1 -m lmms_eval --model llava --model_args pretrained="checkpoint/llava_clipdino336_stage2",device_map="cuda" --tasks ok_vqa --batch_size 1 --log_samples --log_samples_suffix llava_clipdino336_stage2 --output_path ./logs/llava_clipdino336_stage2,
but found that an error occurred: [lmms_eval/models/llava.py:528] ERROR Error Sizes of tensors must match except in dimension 2. Expected size 576 but got size 256 for tensor number 1 in the list. in generating
After debugging, I found clip shape is [1,576,1024] and dino shape is [1,256,1024]. Two features cannot be concat due to spatial space. Is there any error in this part of the code and could you provide the correct code?
def encode_images(self, images):
if type(images) is not list:
image_features = self.get_model().get_vision_tower()(images)
image_features = self.get_model().mm_projector(image_features)
else:
vision_tower = self.get_model().get_vision_tower()
if type(vision_tower) is nn.ModuleList:
f_list = []
for i, v in enumerate(vision_tower):
image_features = v(images[i])
f_list.append(image_features)
import pdb;pdb.set_trace()
image_features = torch.cat(f_list, dim=-1)
image_features = self.get_model().mm_projector(image_features)
return image_features
The text was updated successfully, but these errors were encountered:
The shape of CLIP feature and DINO feature looks weird to me. It should be something like [224, 1024] or [576, 1024]. Should not be that large. This should not a concat bug, should look into image encoding or preprocessing. Let me know if you solved, because I think I replied a bit too late.
Hi, I encountered the same issue before. You can manually modify the input size of the DINOv2 model. Simply uncomment these two lines to resolve the problem.
Thanks for your awesome work!
After I configured the environment according to the warehouse, I ran the scripts about clipdino336:
accelerate launch --num_processes=1 -m lmms_eval --model llava --model_args pretrained="checkpoint/llava_clipdino336_stage2",device_map="cuda" --tasks ok_vqa --batch_size 1 --log_samples --log_samples_suffix llava_clipdino336_stage2 --output_path ./logs/llava_clipdino336_stage2
,but found that an error occurred:
[lmms_eval/models/llava.py:528] ERROR Error Sizes of tensors must match except in dimension 2. Expected size 576 but got size 256 for tensor number 1 in the list. in generating
After debugging, I found clip shape is [1,576,1024] and dino shape is [1,256,1024]. Two features cannot be concat due to spatial space. Is there any error in this part of the code and could you provide the correct code?
The text was updated successfully, but these errors were encountered: