Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the evaluation with clipdino336 #7

Open
XpracticeYSKM opened this issue Oct 9, 2024 · 2 comments
Open

About the evaluation with clipdino336 #7

XpracticeYSKM opened this issue Oct 9, 2024 · 2 comments

Comments

@XpracticeYSKM
Copy link

XpracticeYSKM commented Oct 9, 2024

Thanks for your awesome work!
After I configured the environment according to the warehouse, I ran the scripts about clipdino336:
accelerate launch --num_processes=1 -m lmms_eval --model llava --model_args pretrained="checkpoint/llava_clipdino336_stage2",device_map="cuda" --tasks ok_vqa --batch_size 1 --log_samples --log_samples_suffix llava_clipdino336_stage2 --output_path ./logs/llava_clipdino336_stage2,

but found that an error occurred:
[lmms_eval/models/llava.py:528] ERROR Error Sizes of tensors must match except in dimension 2. Expected size 576 but got size 256 for tensor number 1 in the list. in generating

After debugging, I found clip shape is [1,576,1024] and dino shape is [1,256,1024]. Two features cannot be concat due to spatial space. Is there any error in this part of the code and could you provide the correct code?

    def encode_images(self, images):
        if type(images) is not list:
            image_features = self.get_model().get_vision_tower()(images)
            image_features = self.get_model().mm_projector(image_features)
        else:
            vision_tower = self.get_model().get_vision_tower()
            if type(vision_tower) is nn.ModuleList:
                f_list = []
                for i, v in enumerate(vision_tower):
                    image_features = v(images[i])
                    f_list.append(image_features)
                import pdb;pdb.set_trace()
                image_features = torch.cat(f_list, dim=-1)
                image_features = self.get_model().mm_projector(image_features)
        return image_features

@bronyayang
Copy link
Owner

The shape of CLIP feature and DINO feature looks weird to me. It should be something like [224, 1024] or [576, 1024]. Should not be that large. This should not a concat bug, should look into image encoding or preprocessing. Let me know if you solved, because I think I replied a bit too late.

@DogNeverSleep
Copy link

Hi, I encountered the same issue before. You can manually modify the input size of the DINOv2 model. Simply uncomment these two lines to resolve the problem.

# setattr(self.image_processor, 'crop_size', {'width': 336, 'height': 336})
# setattr(self.image_processor, 'size', {'width': 336, 'height': 336})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants