Causal Conv subsampling causes 1 extra frame in length in outputs #8977
-
for example, audio_signal is of shape [B, 2192, 80 (fbank dim)], after pre_encode of Conformer (causal_downsampling=True, subsampling = "dw_striding" ) the signal becomes [B, 275, 256 (model_dim)], considering model stride to be 8, the length should be 2192/8 = 274 instead of 275. I see that the max length in batch according to encoded_len is still calculated to be 274 (since it's based on the input length and the stride). I wonder where this extra in length is from (e.g. from left padding or right padding), and whether this affects results negatively (since I'm assuming the output will be truncated from the right side using the encoded_len, but what about truncating from the left?) I also see that with causal_downsampling there's no left padding as opposed to the non-causal conv2d layers. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Thanks for your great observation. We also noticed this recently and it is due to adjusted left padding and truncated right padding in the subsampling module to mimic causal convolution with conv2d layers. This has caused to add an extra frame length which is a mismatch to fastconformer 8x subsampling module. |
Beta Was this translation helpful? Give feedback.
Thanks for your great observation.
We also noticed this recently and it is due to adjusted left padding and truncated right padding in the subsampling module to mimic causal convolution with conv2d layers. This has caused to add an extra frame length which is a mismatch to fastconformer 8x subsampling module.