Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions #102

fenneishi · 2025-02-12T03:40:00Z

Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions

Labels: bug
module: Cosmos-Tokenizer

Description

I identified two potential inconsistencies in the EncoderFactorized module that may affect Cosmos-Tokenizer model behavior under specific configurations:

1. Spatial Resolution Adjustment Logic

Code Location: Downsampling Loop Initialization

Observed Behavior:
The variable curr_res (tracking spatial resolution) is unconditionally halved after adding a downsampling layer, even when only temporal downsampling occurs. This might lead to:

Miscalculations of spatial resolution for subsequent attention modules
Attention mechanisms triggering at unintended resolutions

Example Scenario:
If a layer performs only temporal downsampling (no spatial compression), curr_res is still halved. This could cause later layers to incorrectly assume a lower spatial resolution than actually exists.

Code Snippet:

if i_level != self.num_resolutions - 1:
    spatial_down = i_level < self.num_spatial_downs
    temporal_down = i_level < self.num_temporal_downs
    down.downsample = CausalHybridDownsample3d(
              block_in,
              spatial_down=spatial_down, 
              temporal_down=temporal_down,
     )
    curr_res = curr_res // 2 # Halved regardless of spatial_down

Suggested Adjustment:

if i_level != self.num_resolutions - 1:
    spatial_down = i_level < self.num_spatial_downs
    temporal_down = i_level < self.num_temporal_downs
    down.downsample = CausalHybridDownsample3d(
              block_in,
              spatial_down=spatial_down, 
              temporal_down=temporal_down,
     )
    # Only adjust resolution when spatially downsampling
    if spatial_down:
        curr_res = curr_res // 2

2. Mismatched Downsampling Assertion Limits

Code Location: Assertion Checks

Observed Behavior:
The assertions num_spatial_downs <= num_resolutions and num_temporal_downs <= num_resolutions allow configurations that the implementation cannot fulfill.

Root Cause:
Downsampling layers are only added when i_level != num_resolutions - 1, making num_resolutions - 1 the actual maximum allowable downsamplings. The current assertions incorrectly permit num_resolutions, creating a silent failure risk.

Example Failure Case:

# Configuration
num_resolutions = 3
spatial_compression = 8  # Requires 3 spatial downsamples (log2(8)=3)
patch_size = 1

# Current Assertion Passes (3 <= 3) but only 2 downsamples are possible

Suggested Adjustment:

assert self.num_spatial_downs <= self.num_resolutions - 1
assert self.num_temporal_downs <= self.num_resolutions - 1

Why This Matters

Model Accuracy: Incorrect spatial resolution tracking may degrade attention module performance.
Compression Reliability: Overly permissive assertions could silently fail to achieve target compression ratios.

These observations and suggested fixes need further verification and testing.

The text was updated successfully, but these errors were encountered:

fenneishi · 2025-02-12T07:23:57Z

Issue #1(1. Spatial Resolution Adjustment Logic) seems to also exists in DecoderFactorized, but the potential issues in DecoderFactorized appear to be more complex. See Potential incorrect Upsampling Logic and Mismatched Assert Conditions in DecoderFactorized Module for details.

fitsumreda · 2025-02-14T02:22:54Z

Thank you again @fenneishi . Could you please create a PR and confirm if these changes consistently support all video tokenizer configs? I.e. CV4x8x8, CV8x8x8, CV8x16x16, DV4x8x8, DV8x8x8, DV8x16x16?

fenneishi · 2025-02-16T09:52:19Z

Thank you again @fenneishi . Could you please create a PR and confirm if these changes consistently support all video tokenizer configs? I.e. CV4x8x8, CV8x8x8, CV8x16x16, DV4x8x8, DV8x8x8, DV8x16x16?

PR submitted: #106
Ready for review 👋

fitsumreda · 2025-02-17T06:34:20Z

Thanks @fenneishi !! I will take a look. Does your PR also address the other issue created by you here: #103?

fenneishi · 2025-02-17T11:27:18Z

Thanks @fenneishi !! I will take a look. Does your PR also address the other issue created by you here: #103?

@fitsumreda This PR (#106) only addresses the issues discussed in #102. I will submit a separate PR for the issues described in #103 to keep the changes more focused and easier to review.

sophiahhuang added the question Further information is requested label Feb 12, 2025

fenneishi mentioned this issue Feb 16, 2025

Fix/tokenizer sampling #106

Open

devaniranjan assigned fitsumreda Feb 20, 2025

devaniranjan closed this as completed Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions #102

Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions #102

fenneishi commented Feb 12, 2025 •

edited

Loading

fenneishi commented Feb 12, 2025

fitsumreda commented Feb 14, 2025

fenneishi commented Feb 16, 2025

fitsumreda commented Feb 17, 2025

fenneishi commented Feb 17, 2025

Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions #102

Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions #102

Comments

fenneishi commented Feb 12, 2025 • edited Loading

Potential Inconsistencies in Spatial Resolution Tracking and Downsampling Assertions

Description

1. Spatial Resolution Adjustment Logic

2. Mismatched Downsampling Assertion Limits

Why This Matters

fenneishi commented Feb 12, 2025

fitsumreda commented Feb 14, 2025

fenneishi commented Feb 16, 2025

fitsumreda commented Feb 17, 2025

fenneishi commented Feb 17, 2025

fenneishi commented Feb 12, 2025 •

edited

Loading