Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BH] Resnet block in SD 1.4 on BH pcc issues #18650

Open
s-jovic opened this issue Mar 5, 2025 · 9 comments
Open

[BH] Resnet block in SD 1.4 on BH pcc issues #18650

s-jovic opened this issue Mar 5, 2025 · 9 comments

Comments

@s-jovic
Copy link
Contributor

s-jovic commented Mar 5, 2025

Parent issue: #17725

For the first parametrization in the resnet block test, we get 0.0xx PCC. I localized the issue to the call of the reshard op. I caught the issue with comparison mode.

I am adding a workaround to my branch sjovic/sd-bh-patches which basically gets tensor to host, and then pushes to device in the needed memory config - since using DRAM memory config as a mid step on device doesn't help.

I am working on getting a smaller repro for the TM team to look at.

@ejouretTT
Copy link

@cmaryanTT need support from Ops team to look at this

@ntarafdar
Copy link
Contributor

@s-jovic I'm assuming if its blocking resnet. this is a p0?

@cmaryanTT
Copy link

@ntarafdar this is blocking SD, but yes, P0

@s-jovic
Copy link
Contributor Author

s-jovic commented Mar 5, 2025

Yeah, but I don't have a smaller repro yet, I will try to get it tomorrow.

@s-jovic
Copy link
Contributor Author

s-jovic commented Mar 6, 2025

Hey @ntarafdar, just an update for now, I am not able to reproduce this in a simple reshard unit test. Turns out that something corrupts the memory, so whichever way I try to access the tensor, I get pcc issues. I am trying to localize which op is responsible for this, for now I am suspecting bcast op here since there are no issues if I replace it with host version.

@s-jovic s-jovic changed the title [BH] Resnet block in SD 1.4 on BH pcc issues due to reshard op [BH] Resnet block in SD 1.4 on BH pcc issues Mar 6, 2025
@llongTT
Copy link
Contributor

llongTT commented Mar 6, 2025

@s-jovic I was able to pass the test on first param:
PASSED tests/nightly/single_card/stable_diffusion/test_resnet_block_2d.py::test_resnet_block_2d_512x512[2-320-64-64-memory_layout0-buffer_type0-shard_end_core0-shard_shape0-320-False-down-0-0-device_params0]
============================================================================================= 1 passed, 2 warnings in 107.29s (0:01:47) ==============================================================================================
Device | INFO | Closing user mode device drivers

@s-jovic
Copy link
Contributor Author

s-jovic commented Mar 7, 2025

I am using sjovic/sd-bh-patches branch for BH SD currently, since I still need some workarounds that are not on main yet, this test on main uses different configs somewehere in the middle of resnet block and this issue is not hit.

I did some more digging, and turns out the tensor that needs to be resharded is corrupted after ttnn.silu operation. I have repro on sjovic/sd-resnet-pcc branch, which basically just has changes in python model code to rule out any unrelated workaround used on sjovic/sd-bh-patches. Repro:

$ pytest models/demos/wormhole/stable_diffusion/tests/test_resnet_block_2d.py
   

The test asserts when checking input tensor after unrelated op ttnn.silu runs on a different tensor. If we check input tensor before this op, it's fine. If we switch out ttnn op for torch.nn.silu op, input_tensor remains fine. Seems like some memory corruption, but will need to narrow it down further, since the whole memory footprint that happens before silu affects this issue - commenting out random ops that happen before makes the issue go away.

@cmaryanTT
Copy link

@umadevimcw @KalaivaniMCW @Aswinmcw @patrickroberts @dchenTT - FYI - there may be an issue with the unary infra for BH. @KalaivaniMCW - can you please coordinate with Uma and Aswin to test this out and see if you can identify the issue.

@llongTT - please advise if you have any further insight into this.

@s-jovic
Copy link
Contributor Author

s-jovic commented Mar 7, 2025

The silu unit test is added to the branch sjovic/sd-resnet-pcc so that the dimensions and configs can be fetched easier, but it passes on its own - the issue in the resnet block is that it overwrites another tensor in the memory.

$ pytest models/demos/wormhole/stable_diffusion/tests/test_silu.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants