Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release v0.56 blocking] Investigate Resnet50 regressions in both TG + TGG model perf pipelines #18724

Open
tt-rkim opened this issue Mar 6, 2025 · 3 comments
Assignees
Labels
ci-bug bugs found in CI CNN_bug P0

Comments

@tt-rkim
Copy link
Collaborator

tt-rkim commented Mar 6, 2025

We were good until...

Up till this (LKG): https://github.com/tenstorrent/tt-metal/actions/runs/13456349605

the resnet model perf test was stable.

Occasionally the test_perf version with no trace or cqs would fail, but it was non-deterministic. Example: https://github.com/tenstorrent/tt-metal/actions/runs/13447053336/job/37574706436#step:10:113

But then... (2 weeks ago - afternoon of March 5)

Starting around the night of February 21st, we saw the perf trace 2cqs version hanging:

https://github.com/tenstorrent/tt-metal/actions/runs/13467326473/job/37635776939#step:10:119

This seems deterministic.

In fact, here are some runs which show the same deterministic hang:

The commit range is:

But even more! ... (Afternoon of March 5 - now)

At 12:15pm, March 6, I did a reboot of the modules on the perf TGG machine via the Galaxy Management UI. Gonna see if that helps stamp out any additional issues so we can continue looking at the previous issue.

Doing another run here: https://github.com/tenstorrent/tt-metal/actions/runs/13704106394

cc: @tenstorrent/metalium-developers-infra

@mywoodstock
Copy link
Contributor

OK, confirmed that the run started hanging starting with this commit: 1eef336
The commit before this (a7fffd259566503e5de2fdbaa335dc4c5ed524ce) is good.
cc: @yugaoTT

@yugaoTT
Copy link
Contributor

yugaoTT commented Mar 6, 2025

@mywoodstock do we have a equivalent resnet50 on N150 CI? If that one doesn't hang, I don't see why TG hangs, since it's just a change within matmul op?

@mywoodstock
Copy link
Contributor

mywoodstock commented Mar 6, 2025

@yugaoTT Yes, we have other tests with the same version of Resnet50, only this specific case of trace + 2cq on TG and TGG hang deterministically. All others are good.
I tried the current main with your commit reverted, and that does not hang.
So definitely something in the commit is causing the hang -- Not sure what can cause this though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-bug bugs found in CI CNN_bug P0
Projects
None yet
Development

No branches or pull requests

4 participants