-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where is input normalization applied? #49
Comments
Up @kkoutini How should ImageNet normalization statistics be cascaded down to MelSpec 1 channel inputs for downstream finetuning? Where is this applied in the code? Many thanks |
Hi Antoine, |
Thanks for getting back to me Khaled! Ok I see! So the first training is done on ImageNet with ImageNet statistics, then the model pretrained on ImageNet is finetuned on Audioset, using Audioset statistics, correct? Are the two statistics (ImageNet's and Audioset's) not related in any way? Shouldn't ImageNet statistics have been propagated all the way down, aggregated from 3 channels to 1? Finally, I see that you normalize after applying masks. Is this the correct way to do it? I noted in your paper the following augmentations:
I struggle to understand the order in which everything goes. I would expect the following steps: waveform loading->waveforms mixup->mel (feature computation)-> augmentations How is it really? Many thanks |
yes, I think you can keep the same, mean=4.5 and std=5 if you're using the same spectrograms module.
Ah, I guess you may get imporvments, if you do the masking after normalizing. yes, the correct order is oading->waveforms mixup->mel (feature computation)-> augmentations. The waveform mixing is done first in the dataset here with the waveform augmentations. |
Hi Khaled,
Could you please point me to where normalization is applied to inputs? (for the esc50 case or any other cases)
I am talking about channels mean and std such as written in the code below:
If the first training was done on ImageNet, then I guess ImageNet channels mean and std are applied to Audiosets input when finetuning on this dataset, and also to ESC50 inputs if further finetuning on this one. Am I correct?
Again, I am trying to refactor your code to have only the interesting portion for us fit into our already existing training scripts. But I don't see where those means and standard deviations are applied, whether in the dataset or in AugmentMel.
Thanks a lot (again)
Antoine
The text was updated successfully, but these errors were encountered: