Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#0: Improve error messages in Tensor storage classes #18321

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jjiangTT
Copy link
Contributor

@jjiangTT jjiangTT commented Feb 25, 2025

Ticket

Link to Github Issue: N/A

Problem description

An error was a little unclear and potentially misleading and some breaking errors were marked as assert rather than fatal.

What's changed

Reworded and changed a spec assert to a TT_FATAL making the problem more clear for users who run into this error in the future.
Changed some buffer checks to FATAL.

Checklist

@@ -200,7 +200,7 @@ struct MultiDeviceHostStorage {

TensorSpec get_tensor_spec(int spec_index) const {
std::lock_guard<std::mutex> lock(mtx);
TT_ASSERT(spec_index < specs.size(), "Buffer not found for device {}", spec_index);
TT_ASSERT(spec_index < specs.size(), "Spec for device {} not found in spec list, were buffer and spec added?", spec_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be the exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be TT_FATAL. I'd also argue "were buffer and spec added?" is more confusing for the end user to see than just a failing assertion.

Raises more questions unnecessarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming by exception Denys means a TT_THROW, I personally think a FATAL would be more clear. Both would be preferable to the existing assert though.

@jjiangTT jjiangTT changed the title jjiang/spec error improvements jjiang/clarify-spec-assert Feb 26, 2025
@jjiangTT jjiangTT changed the title jjiang/clarify-spec-assert jjiang/clarify_spec_assert Feb 26, 2025
@jjiangTT jjiangTT changed the title jjiang/clarify_spec_assert jjiang/spec_error_improvements Feb 26, 2025
@jjiangTT jjiangTT force-pushed the jjiang/spec_error_improvements branch from f87408a to 3b83b09 Compare February 26, 2025 19:25
@@ -200,7 +200,7 @@ struct MultiDeviceHostStorage {

TensorSpec get_tensor_spec(int spec_index) const {
std::lock_guard<std::mutex> lock(mtx);
TT_ASSERT(spec_index < specs.size(), "Buffer not found for device {}", spec_index);
TT_FATAL(spec_index < specs.size(), "Spec for device {} not found in spec list", spec_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make the same change in other places in the file, for consistency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

@jjiangTT jjiangTT force-pushed the jjiang/spec_error_improvements branch from 3b83b09 to 58e542a Compare February 27, 2025 15:47
@@ -325,7 +325,7 @@ struct MultiDeviceStorage {

inline std::shared_ptr<Buffer> get_buffer_for_device(IDevice* device) const {
std::lock_guard<std::mutex> lock(buffer_mtx);
TT_ASSERT(buffers.find(device->id()) != buffers.end(), "Buffer not found for device {}", device->id());
TT_FATAL(buffers.find(device->id()) != buffers.end(), "Buffer not found for device {}", device->id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one and below, you are performing redundant lookups in an unordered map. This involves hashing the key, and potentially iterating over a linked list of keys that hashes to the same value.

To avoid this, you can do:

auto buffer_it = buffers.find(device->id();
TT_FATAL(buffer_it != buffers.end(), ...)
// Perhaps we can make this TT_FATAL as well, but let's leave for now.
TT_ASSERT(...)
return buffer_it->second;

@omilyutin-tt
Copy link
Contributor

Also jjiang/spec_error_improvements can be renamed to #0: Improve error messages in Tensor storage classes (#0 as there is no issue number associated with this clean up)

@jjiangTT jjiangTT changed the title jjiang/spec_error_improvements #0: Improve error messages in Tensor storage classes Feb 27, 2025
@jjiangTT jjiangTT force-pushed the jjiang/spec_error_improvements branch from 0b7f743 to 8a9e4b1 Compare March 3, 2025 17:26
@jjiangTT jjiangTT requested a review from omilyutin-tt March 3, 2025 17:27
}

inline std::shared_ptr<Buffer>& get_buffer_for_device(IDevice* device) {
std::lock_guard<std::mutex> lock(buffer_mtx);
TT_ASSERT(buffers.find(device->id()) != buffers.end(), "Buffer not found for device {}", device->id());
TT_FATAL(buffers.find(device->id()) != buffers.end(), "Buffer not found for device {}", device->id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the same here and below in get_tensor_spec_for_device

TT_ASSERT(
buffers.at(device->id())->device() == device,
buffer_it->device() == device,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_it->second->device() I believe? TT_ASSERT is only compiled in debug mode, try to build with --debug flag.

@jjiangTT jjiangTT force-pushed the jjiang/spec_error_improvements branch 2 times, most recently from cf9d42e to 4c9ed77 Compare March 3, 2025 19:20
@jjiangTT jjiangTT requested a review from omilyutin-tt March 3, 2025 22:45
@jjiangTT jjiangTT force-pushed the jjiang/spec_error_improvements branch 5 times, most recently from 8eea513 to 81d6f70 Compare March 6, 2025 18:47
@jjiangTT jjiangTT force-pushed the jjiang/spec_error_improvements branch from 81d6f70 to 4eeb969 Compare March 6, 2025 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants