Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is input normalization applied? #49

Open
Antoine101 opened this issue Apr 19, 2024 · 4 comments
Open

Where is input normalization applied? #49

Antoine101 opened this issue Apr 19, 2024 · 4 comments

Comments

@Antoine101
Copy link

Hi Khaled,

Could you please point me to where normalization is applied to inputs? (for the esc50 case or any other cases)

I am talking about channels mean and std such as written in the code below:

IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406)
IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)
IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5)
IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)


def _cfg(url='', **kwargs):
    return {
        'url': url,
        'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': None,
        'crop_pct': .9, 'interpolation': 'bicubic', 'fixed_input_size': True,
        'mean': IMAGENET_INCEPTION_MEAN, 'std': IMAGENET_INCEPTION_STD,
        'first_conv': 'patch_embed.proj', 'classifier': 'head',
        **kwargs
    }

If the first training was done on ImageNet, then I guess ImageNet channels mean and std are applied to Audiosets input when finetuning on this dataset, and also to ESC50 inputs if further finetuning on this one. Am I correct?

Again, I am trying to refactor your code to have only the interesting portion for us fit into our already existing training scripts. But I don't see where those means and standard deviations are applied, whether in the dataset or in AugmentMel.

Thanks a lot (again)

Antoine

@Antoine101
Copy link
Author

Up @kkoutini
Not sure if you saw this.

How should ImageNet normalization statistics be cascaded down to MelSpec 1 channel inputs for downstream finetuning? Where is this applied in the code?

Many thanks

@kkoutini
Copy link
Owner

Hi Antoine,
I'm sorry I missed this issue.
The normallization is applied (hard coded) here
I think stats is was calculated based on a subset of Audioset. In my runs, I used the same spectrogram prerpocessor to all datasets for fine-tuning.

@Antoine101
Copy link
Author

Thanks for getting back to me Khaled!

Ok I see!

So the first training is done on ImageNet with ImageNet statistics, then the model pretrained on ImageNet is finetuned on Audioset, using Audioset statistics, correct?
So if I later finetune the model already finetuned on Audioset on another dataset I should use mean=4.5 and std=5.

Are the two statistics (ImageNet's and Audioset's) not related in any way? Shouldn't ImageNet statistics have been propagated all the way down, aggregated from 3 channels to 1?

Finally, I see that you normalize after applying masks. Is this the correct way to do it?

I noted in your paper the following augmentations:

  • Two level mixup
  • Specaugment
  • Rolling
  • Random Gain

I struggle to understand the order in which everything goes.
I see my_mixup after mel_forward in ex_esc50.py although it is said in your paper that waveforms are mixed.

I would expect the following steps: waveform loading->waveforms mixup->mel (feature computation)-> augmentations

How is it really?

Many thanks

@kkoutini
Copy link
Owner

yes, I think you can keep the same, mean=4.5 and std=5 if you're using the same spectrograms module.

Finally, I see that you normalize after applying masks. Is this the correct way to do it?

Ah, I guess you may get imporvments, if you do the masking after normalizing.

yes, the correct order is oading->waveforms mixup->mel (feature computation)-> augmentations. The waveform mixing is done first in the dataset here with the waveform augmentations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants