-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
From ViT models to audio #45
Comments
Hi Antoine, Yes, the code should support more architectures. If the input channels are different, the input channels are averaged here and here If the input size is different (for example, |
Great, I'll have a look at all that! |
Regarding that question, I see in the code that there are different lists of architectures available between
Thanks a lot. |
Hi, |
Hi Khaled, Thanks for the answer. Regarding your first reply on this thread, concerning the adaptation/averaging of input channels, why is there a sum on dim=1 in the code instead of a mean? In adapt_input_conv here |
I think you are right |
Great, thanks for the confirmation! |
Hi Khaled,
In your code, there is the possibility to create a ViT architecture and load the corresponding pretrained weights (like "vit_tiny_patch16_224").
Do we agree that such architectures only work with similar size inputs (224224 for example)? If so, how did you finetune a model on Audioset that was initially trained on Imagenet (going from 224224 to 128*998 for example)? Is this procedure in some code in your repo?
I read the AST paper I guess you took inspiration from and they talk about it in some details.
I was just wondering how I would do the whole process (ImageNet -> AudioSet -> ESC50) on my end.
Thanks a lot.
Antoine
The text was updated successfully, but these errors were encountered: