-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BH] Resnet block in SD 1.4 on BH pcc issues #18650
Comments
@cmaryanTT need support from Ops team to look at this |
@s-jovic I'm assuming if its blocking resnet. this is a p0? |
@ntarafdar this is blocking SD, but yes, P0 |
Yeah, but I don't have a smaller repro yet, I will try to get it tomorrow. |
Hey @ntarafdar, just an update for now, I am not able to reproduce this in a simple reshard unit test. Turns out that something corrupts the memory, so whichever way I try to access the tensor, I get pcc issues. I am trying to localize which op is responsible for this, for now I am suspecting bcast op here since there are no issues if I replace it with host version. |
@s-jovic I was able to pass the test on first param: |
I am using I did some more digging, and turns out the tensor that needs to be resharded is corrupted after ttnn.silu operation. I have repro on
The test asserts when checking input tensor after unrelated op ttnn.silu runs on a different tensor. If we check input tensor before this op, it's fine. If we switch out ttnn op for torch.nn.silu op, input_tensor remains fine. Seems like some memory corruption, but will need to narrow it down further, since the whole memory footprint that happens before silu affects this issue - commenting out random ops that happen before makes the issue go away. |
@umadevimcw @KalaivaniMCW @Aswinmcw @patrickroberts @dchenTT - FYI - there may be an issue with the unary infra for BH. @KalaivaniMCW - can you please coordinate with Uma and Aswin to test this out and see if you can identify the issue. @llongTT - please advise if you have any further insight into this. |
The silu unit test is added to the branch
|
Parent issue: #17725
For the first parametrization in the resnet block test, we get 0.0xx PCC. I localized the issue to the call of the reshard op. I caught the issue with comparison mode.
I am adding a workaround to my branch
sjovic/sd-bh-patches
which basically gets tensor to host, and then pushes to device in the needed memory config - since using DRAM memory config as a mid step on device doesn't help.I am working on getting a smaller repro for the TM team to look at.
The text was updated successfully, but these errors were encountered: