Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions #102

Closed
fenneishi opened this issue Feb 12, 2025 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@fenneishi
Copy link

fenneishi commented Feb 12, 2025

Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions

Labels: bug
module: Cosmos-Tokenizer

Description

I identified two potential inconsistencies in the EncoderFactorized module that may affect Cosmos-Tokenizer model behavior under specific configurations:

1. Spatial Resolution Adjustment Logic

Code Location: Downsampling Loop Initialization

Observed Behavior:
The variable curr_res (tracking spatial resolution) is unconditionally halved after adding a downsampling layer, even when only temporal downsampling occurs. This might lead to:

  • Miscalculations of spatial resolution for subsequent attention modules
  • Attention mechanisms triggering at unintended resolutions

Example Scenario:
If a layer performs only temporal downsampling (no spatial compression), curr_res is still halved. This could cause later layers to incorrectly assume a lower spatial resolution than actually exists.

Code Snippet:

if i_level != self.num_resolutions - 1:
    spatial_down = i_level < self.num_spatial_downs
    temporal_down = i_level < self.num_temporal_downs
    down.downsample = CausalHybridDownsample3d(
              block_in,
              spatial_down=spatial_down, 
              temporal_down=temporal_down,
     )
    curr_res = curr_res // 2 # Halved regardless of spatial_down

Suggested Adjustment:

if i_level != self.num_resolutions - 1:
    spatial_down = i_level < self.num_spatial_downs
    temporal_down = i_level < self.num_temporal_downs
    down.downsample = CausalHybridDownsample3d(
              block_in,
              spatial_down=spatial_down, 
              temporal_down=temporal_down,
     )
    # Only adjust resolution when spatially downsampling
    if spatial_down:
        curr_res = curr_res // 2

2. Mismatched Downsampling Assertion Limits

Code Location: Assertion Checks

Observed Behavior:
The assertions num_spatial_downs <= num_resolutions and num_temporal_downs <= num_resolutions allow configurations that the implementation cannot fulfill.

Root Cause:
Downsampling layers are only added when i_level != num_resolutions - 1, making num_resolutions - 1 the actual maximum allowable downsamplings. The current assertions incorrectly permit num_resolutions, creating a silent failure risk.

Example Failure Case:

# Configuration
num_resolutions = 3
spatial_compression = 8  # Requires 3 spatial downsamples (log2(8)=3)
patch_size = 1

# Current Assertion Passes (3 <= 3) but only 2 downsamples are possible

Suggested Adjustment:

assert self.num_spatial_downs <= self.num_resolutions - 1
assert self.num_temporal_downs <= self.num_resolutions - 1

Why This Matters

  • Model Accuracy: Incorrect spatial resolution tracking may degrade attention module performance.
  • Compression Reliability: Overly permissive assertions could silently fail to achieve target compression ratios.

These observations and suggested fixes need further verification and testing.

@fenneishi
Copy link
Author

Issue #1(1. Spatial Resolution Adjustment Logic) seems to also exists in DecoderFactorized, but the potential issues in DecoderFactorized appear to be more complex. See Potential incorrect Upsampling Logic and Mismatched Assert Conditions in DecoderFactorized Module for details.

@sophiahhuang sophiahhuang added the question Further information is requested label Feb 12, 2025
@fitsumreda
Copy link
Collaborator

Thank you again @fenneishi . Could you please create a PR and confirm if these changes consistently support all video tokenizer configs? I.e. CV4x8x8, CV8x8x8, CV8x16x16, DV4x8x8, DV8x8x8, DV8x16x16?

@fenneishi
Copy link
Author

Thank you again @fenneishi . Could you please create a PR and confirm if these changes consistently support all video tokenizer configs? I.e. CV4x8x8, CV8x8x8, CV8x16x16, DV4x8x8, DV8x8x8, DV8x16x16?

PR submitted: #106
Ready for review 👋

@fitsumreda
Copy link
Collaborator

Thanks @fenneishi !! I will take a look. Does your PR also address the other issue created by you here: #103?

@fenneishi
Copy link
Author

Thanks @fenneishi !! I will take a look. Does your PR also address the other issue created by you here: #103?

@fitsumreda This PR (#106) only addresses the issues discussed in #102. I will submit a separate PR for the issues described in #103 to keep the changes more focused and easier to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants