You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions
Labels: bug module: Cosmos-Tokenizer
Description
I identified two potential inconsistencies in the EncoderFactorized module that may affect Cosmos-Tokenizer model behavior under specific configurations:
Observed Behavior:
The variable curr_res (tracking spatial resolution) is unconditionally halved after adding a downsampling layer, even when only temporal downsampling occurs. This might lead to:
Miscalculations of spatial resolution for subsequent attention modules
Attention mechanisms triggering at unintended resolutions
Example Scenario:
If a layer performs only temporal downsampling (no spatial compression), curr_res is still halved. This could cause later layers to incorrectly assume a lower spatial resolution than actually exists.
Observed Behavior:
The assertions num_spatial_downs <= num_resolutions and num_temporal_downs <= num_resolutions allow configurations that the implementation cannot fulfill.
Root Cause:
Downsampling layers are only added when i_level != num_resolutions - 1, making num_resolutions - 1 the actual maximum allowable downsamplings. The current assertions incorrectly permit num_resolutions, creating a silent failure risk.
Example Failure Case:
# Configurationnum_resolutions=3spatial_compression=8# Requires 3 spatial downsamples (log2(8)=3)patch_size=1# Current Assertion Passes (3 <= 3) but only 2 downsamples are possible
Thank you again @fenneishi . Could you please create a PR and confirm if these changes consistently support all video tokenizer configs? I.e. CV4x8x8, CV8x8x8, CV8x16x16, DV4x8x8, DV8x8x8, DV8x16x16?
Thank you again @fenneishi . Could you please create a PR and confirm if these changes consistently support all video tokenizer configs? I.e. CV4x8x8, CV8x8x8, CV8x16x16, DV4x8x8, DV8x8x8, DV8x16x16?
Thanks @fenneishi !! I will take a look. Does your PR also address the other issue created by you here: #103?
@fitsumreda This PR (#106) only addresses the issues discussed in #102. I will submit a separate PR for the issues described in #103 to keep the changes more focused and easier to review.
Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions
Labels: bug
module: Cosmos-Tokenizer
Description
I identified two potential inconsistencies in the
EncoderFactorized
module that may affect Cosmos-Tokenizer model behavior under specific configurations:1. Spatial Resolution Adjustment Logic
Code Location: Downsampling Loop Initialization
Observed Behavior:
The variable
curr_res
(tracking spatial resolution) is unconditionally halved after adding a downsampling layer, even when only temporal downsampling occurs. This might lead to:Example Scenario:
If a layer performs only temporal downsampling (no spatial compression),
curr_res
is still halved. This could cause later layers to incorrectly assume a lower spatial resolution than actually exists.Code Snippet:
Suggested Adjustment:
2. Mismatched Downsampling Assertion Limits
Code Location: Assertion Checks
Observed Behavior:
The assertions
num_spatial_downs <= num_resolutions
andnum_temporal_downs <= num_resolutions
allow configurations that the implementation cannot fulfill.Root Cause:
Downsampling layers are only added when
i_level != num_resolutions - 1
, makingnum_resolutions - 1
the actual maximum allowable downsamplings. The current assertions incorrectly permitnum_resolutions
, creating a silent failure risk.Example Failure Case:
Suggested Adjustment:
Why This Matters
These observations and suggested fixes need further verification and testing.
The text was updated successfully, but these errors were encountered: