You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At 12:15pm, March 6, I did a reboot of the modules on the perf TGG machine via the Galaxy Management UI. Gonna see if that helps stamp out any additional issues so we can continue looking at the previous issue.
OK, confirmed that the run started hanging starting with this commit: 1eef336
The commit before this (a7fffd259566503e5de2fdbaa335dc4c5ed524ce) is good.
cc: @yugaoTT
@mywoodstock do we have a equivalent resnet50 on N150 CI? If that one doesn't hang, I don't see why TG hangs, since it's just a change within matmul op?
@yugaoTT Yes, we have other tests with the same version of Resnet50, only this specific case of trace + 2cq on TG and TGG hang deterministically. All others are good.
I tried the current main with your commit reverted, and that does not hang.
So definitely something in the commit is causing the hang -- Not sure what can cause this though.
We were good until...
Up till this (LKG): https://github.com/tenstorrent/tt-metal/actions/runs/13456349605
the resnet model perf test was stable.
Occasionally the
test_perf
version with no trace or cqs would fail, but it was non-deterministic. Example: https://github.com/tenstorrent/tt-metal/actions/runs/13447053336/job/37574706436#step:10:113But then... (2 weeks ago - afternoon of March 5)
Starting around the night of February 21st, we saw the perf trace 2cqs version hanging:
https://github.com/tenstorrent/tt-metal/actions/runs/13467326473/job/37635776939#step:10:119
This seems deterministic.
In fact, here are some runs which show the same deterministic hang:
The commit range is:
But even more! ... (Afternoon of March 5 - now)
At 12:15pm, March 6, I did a reboot of the modules on the perf TGG machine via the Galaxy Management UI. Gonna see if that helps stamp out any additional issues so we can continue looking at the previous issue.
Doing another run here: https://github.com/tenstorrent/tt-metal/actions/runs/13704106394
cc: @tenstorrent/metalium-developers-infra
The text was updated successfully, but these errors were encountered: