Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1D Fabric launch via device init #18712

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

aagarwalTT
Copy link
Contributor

@aagarwalTT aagarwalTT commented Mar 6, 2025

Ticket

Link to Github Issue

Problem description

1D Fabric (EDM Fabric) is currently launched as a subdevice. This change moves the initialization of 1D Fabric to device init.

What's changed

  • Moved 1D Fabric implementation and relevant headers/files to metal from ttnn
  • Added the logic for launching 1D fabric during device init and corresponding sync and teardown sequence
  • Moved the 1D fabric namespace to the existing tt::tt_fabric namespace and got rid of tt::fabric

Things to be covered in follow-up PR(s):

  • Migrating the tests to use the new init flow and not subdevice
  • Renaming from EDM to 1D fabric wherever applicable

Checklist

@aagarwalTT aagarwalTT changed the title Aagarwal/1d fabric device init 1d fabric device init Mar 6, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: do we need dedicated/unique sender and receiver test kernels between 1D and 2D fabric if they aren't directly interacting with fabric? Can the 1D just be a simplified version (e.g. some compile time args to disable mcasting in 4 directions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have common kernels for testing once we have some abstraction layer that unifies the two? currently the setup is bit different, so I created a new file to avoid confusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a git mv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was initially, but then I split it to move the LineOpInterface into a helper file in ttnn itself

Copy link
Contributor

@SeanNijjar SeanNijjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial comments. I'm not familiar with topology so I'll spend a little time later looking at it but would obviously defer to the other experts on the team for that design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git mv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to omit all the reformatting from this PR? It'll simplify review and keep it more incremental

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt update the clang format file, so I think this is the result of that. I will see if this can be reverted.

FORCE_INLINE bool all_eth_packets_completed() const {
return this->local_rdptr.is_caught_up_to(this->local_wrptr);
[[nodiscard]] FORCE_INLINE bool connection_is_live() const {
return *connection_live_semaphore == tt::fabric::EdmToEdmSender<0>::open_connection_value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this grabbed from a branch of mine or something? Unsure why it's needed in the PR but I think I did make this change elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this change made it here, will have to dig a bit. I certainly did not pull anything from your branch though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the order of functions was changed probably because of clang formatting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the reason we need this here? What's the purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this in ttnn would keep the current tests intact. We do not want this in metal since the fabric init has moved to device init. So until we port the tests, this can exist in ttnn as a helper to launch 1d fabric using sub device

@aagarwalTT aagarwalTT changed the title 1d fabric device init 1D Fabric launch via device init Mar 7, 2025
@aagarwalTT aagarwalTT marked this pull request as ready for review March 7, 2025 01:15
@aagarwalTT aagarwalTT force-pushed the aagarwal/1d-fabric-device-init branch from 87e09c3 to 265a94d Compare March 7, 2025 01:52

// Device ID -> EDM Builders
std::unordered_map<size_t, std::vector<tt::tt_fabric::FabricEriscDatamoverBuilder>> edm_builders_forward_direction;
std::unordered_map<size_t, std::vector<tt::tt_fabric::FabricEriscDatamoverBuilder>> edm_builders_backward_direction;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to make private?

Copy link
Contributor Author

@aagarwalTT aagarwalTT Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a good chance that this file will be deprecated soon once the tests are migrated to the new init flow

Comment on lines +108 to +118
// Device ID -> link index
std::unordered_map<size_t, size_t> next_forward_direction_edm_available;
std::unordered_map<size_t, size_t> next_backward_direction_edm_available;

std::vector<tt::tt_metal::IDevice*> device_sequence;
std::vector<tt::tt_metal::Program*> programs;

size_t num_links;
size_t buffer_size_bytes;
size_t firmware_context_switch_interval =
tt::tt_fabric::FabricEriscDatamoverBuilder::default_firmware_context_switch_interval;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add suffix _ for data members

Comment on lines +115 to +116
size_t num_links;
size_t buffer_size_bytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize to 0 to avoid UB

Comment on lines +83 to +88
for (size_t i = 0; i < device_sequence.size(); i++) {
if (device_sequence[i] == device) {
return i;
}
}
TT_THROW("Device {} not found in device sequence of line fabric", device->id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to move to.cpp?

@aagarwalTT aagarwalTT force-pushed the aagarwal/1d-fabric-device-init branch from 95af811 to a97630c Compare March 7, 2025 14:51
if (routing_direction == direction) {
return eth_chans;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be safer to assert here instead? Or are callers of get_active_fabric_eth_channels_in_direction responsible for checking for no eth channels in direction

@@ -83,4 +85,12 @@ class FabricFixture : public ::testing::Test {
}
};

class FabricFixture : public BaseFabricFixture {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we call this Fabric2DFixture?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants