Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From ViT models to audio #45

Open
Antoine101 opened this issue Mar 21, 2024 · 7 comments
Open

From ViT models to audio #45

Antoine101 opened this issue Mar 21, 2024 · 7 comments

Comments

@Antoine101
Copy link

Hi Khaled,

In your code, there is the possibility to create a ViT architecture and load the corresponding pretrained weights (like "vit_tiny_patch16_224").

Do we agree that such architectures only work with similar size inputs (224224 for example)? If so, how did you finetune a model on Audioset that was initially trained on Imagenet (going from 224224 to 128*998 for example)? Is this procedure in some code in your repo?

I read the AST paper I guess you took inspiration from and they talk about it in some details.
I was just wondering how I would do the whole process (ImageNet -> AudioSet -> ESC50) on my end.

Thanks a lot.

Antoine

@kkoutini
Copy link
Owner

Hi Antoine,

Yes, the code should support more architectures.

If the input channels are different, the input channels are averaged here and here

If the input size is different (for example, 224x224 to 128x998), the only thing that is changed is the positional embeddings, this is done here
in short, the positional embeddings are interpolated to match the new size (similar to AST). After that, they are averged over time/freq to produce freq/time positional embeddings.

@Antoine101
Copy link
Author

Great, I'll have a look at all that!
Thanks a lot.

@Antoine101 Antoine101 reopened this Apr 4, 2024
@Antoine101
Copy link
Author

Regarding that question, I see in the code that there are different lists of architectures available between get_model function, default_cfg dictionnary and architecture functions.

default_cfg seems to be the most exhaustive but not every architecture in this dict is covered in get_model or has a dedicated function that calls _create_vision_transformer.
Is it just because you didn't test them all or didn't have time to implement everything or is there another specific reason?

See below:
image

image

image

Thanks a lot.

@kkoutini
Copy link
Owner

kkoutini commented Apr 4, 2024

Hi,
I got the basis code from tim library. where it has links for different models then I added the models that I trained one by one in the same fashion with a link to download the weights. The missing ones are the ones that I didn't use. However, I believe it should work if you add more ViT in the same way.

@Antoine101
Copy link
Author

Hi Khaled,

Thanks for the answer.

Regarding your first reply on this thread, concerning the adaptation/averaging of input channels, why is there a sum on dim=1 in the code instead of a mean? In adapt_input_conv here

@kkoutini
Copy link
Owner

I think you are right mean should work better

@Antoine101
Copy link
Author

Great, thanks for the confirmation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants