Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#18006: Add push router support to tt-fabric. #18564

Merged
merged 1 commit into from
Mar 7, 2025

Conversation

ubcheema
Copy link
Contributor

@ubcheema ubcheema commented Mar 3, 2025

#18006

Adding support for push buffers in TT-Fabric routers.

@nhuang-tt
Copy link
Member

How does the programming model change when you use the push method in this? Do we need need to allocate space for the headers?
Right now i need to put 48B before all my data to make space for the header

@ubcheema ubcheema force-pushed the ucheema/tt-fabric-push-router-support branch from 6394ef9 to 865a5cd Compare March 3, 2025 19:00
@ubcheema
Copy link
Contributor Author

ubcheema commented Mar 3, 2025

How does the programming model change when you use the push method in this? Do we need need to allocate space for the headers? Right now i need to put 48B before all my data to make space for the header

It is the same in this PR. You need to allocate packet header space in the data buffer.
I will add "no inline header" support soon. This will allow you to setup a header memory separately from the payload.

@@ -37,7 +40,649 @@ inline uint64_t get_timestamp() {
return (((uint64_t)timestamp_high) << 32) | timestamp_low;
}

typedef struct fvc_consumer_state {
typedef struct fvc_outbound_push_state {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about having base struct and inherit for both push and pull. There are same variables and copy&paste. and also #ifdef around code could be removed by having big #ifdef in this file.

#ifdef FVC_MODE_PULL
fvc_outbound_pull_state_t fvc_outbound_state __attribute__((aligned(16))); // replicate for each fvc
#else
// fvc_inbound_push_state_t fvc_inbound_state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

FABRIC_ROUTER_DATA_BUF_START + (fvc_data_buf_size_words * PACKET_WORD_SIZE_BYTES / 2),
fvc_data_buf_size_words / 2);
#else
fvc_outbound_state[0].init(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be for loop

Comment on lines 184 to 188
router_addr += direction * 8;
// stream register to receive router buffer space available updates.
uint64_t xy_local_addr = get_noc_addr(0);
noc_inline_dw_write(router_addr, (STREAM_REG_ADDR(0, STREAM_REMOTE_DEST_BUF_SPACE_AVAILABLE_UPDATE_REG_INDEX)));
noc_inline_dw_write(router_addr + 4, xy_local_addr >> 32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these magic number. 8 and 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update comments. NOC addresses are 64-bits, 8 bytes. To calculate respective offset for a direction, its direction * 8.
line 187 is writing lower/upper 4 bytes of the 8 bytes with two noc inline dword writes.

// set the stream auto increment to use on remote router according to
// this router's direction.
router_push_addr =
(STREAM_REG_ADDR(6 + my_direction, STREAM_REMOTE_DEST_BUF_SPACE_AVAILABLE_UPDATE_REG_INDEX));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic number 6?

@ubcheema ubcheema force-pushed the ucheema/tt-fabric-push-router-support branch 2 times, most recently from df3fff8 to 532011d Compare March 3, 2025 20:29
@@ -1410,6 +1413,7 @@ int main(int argc, char **argv) {
bool fixed_async_wr_notif_addr = test_args::has_command_option(input_args, "--fixed_async_wr_notif_addr");

benchmark_mode = test_args::has_command_option(input_args, "--benchmark");
push_mode = test_args::has_command_option(input_args, "--push_router");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests for the push model to CI?
We have:
Post Commit on N300: tests/scripts/run_cpp_fabric_tests.sh
T3k unit tests: tests/scripts/t3000/run_t3000_unit_tests.sh
TG unit tests: tests/scripts/tg/run_tg_unit_tests.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests to CI.

Copy link
Contributor

@abhullar-tt abhullar-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made some comments to ease uplift to BH

@ubcheema ubcheema force-pushed the ucheema/tt-fabric-push-router-support branch from 532011d to 691bada Compare March 3, 2025 22:20
packet_word_0 = (volatile uint32_t*)get_local_buffer_read_addr();
packet_word_1 =
reinterpret_cast<tt_l1_ptr uint32_t*>(buffer_start + (out_rdptr_inc<fvc_mode>(1) * PACKET_WORD_SIZE_BYTES));
packet_word_2 =
reinterpret_cast<tt_l1_ptr uint32_t*>(buffer_start + (out_rdptr_inc<fvc_mode>(2) * PACKET_WORD_SIZE_BYTES));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these 0,1,2 for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are pointers to first 3 packet words which point to packet header. It accounts for the case where the header might be wrapping at the end of the buffer.

Comment on lines 498 to 501
uint32_t temp = packet_word_0[0];
sender_buffer_index = temp >> 30;
temp &= 0x3FFFFFFF;
packet_words_remaining = (temp + PACKET_WORD_SIZE_BYTES - 1) >> 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 'temp' looks having more meaning than temporal value. What is this?

}

uint32_t get_next_hop_router_noc_xy() {
uint32_t dst_mesh_id = packet_word_0[1] & 0xFFFF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this indexing and masking?
Looks like this packet_word_* is accessing header values? If so I want more explicit accessor. getter or casting

loop_count = 0;
} else if (
*fvc_outbound_state[next_outbound_buffer].noc_word_credits ||
*fvc_outbound_state[next_outbound_buffer].sender_words_cleared) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getter? get_num_words_cleared
or wrap these two check in one method?

Comment on lines 215 to 221
} else if (
*fvc_outbound_state[next_outbound_buffer].noc_word_credits ||
*fvc_outbound_state[next_outbound_buffer].sender_words_cleared) {
curr_outbound_buffer = next_outbound_buffer;
next_outbound_buffer = (next_outbound_buffer + 1) & 0x3;
} else {
next_outbound_buffer = (next_outbound_buffer + 1) & 0x3;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else {
    if (*fvc_outbound_state[next_outbound_buffer].noc_word_credits ||
        *fvc_outbound_state[next_outbound_buffer].sender_words_cleared) {
        curr_outbound_buffer = next_outbound_buffer;
    }
    next_outbound_buffer = (next_outbound_buffer + 1) & 0x3;
}

}
}

FORCE_INLINE void get_words_pushed() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getter should return something.

@ubcheema ubcheema force-pushed the ucheema/tt-fabric-push-router-support branch from e295ef9 to 2cc6974 Compare March 4, 2025 00:03
if constexpr (fvc_mode == FVC_MODE_ROUTER) {
tt::tt_fabric::chan_id_t my_chan = routing_table->intra_mesh_table.dest_entry[device_id];
uint32_t my_direction;
if (routing_table->port_direction.directions[eth_chan_directions::EAST] == my_chan) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are duplicate code. How about having function

next_port = routing_table->intra_mesh_table.dest_entry[dst_dev_id];
}

if (routing_table->port_direction.directions[eth_chan_directions::EAST] == next_port) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dup

next_port = routing_table[routing_plane].intra_mesh_table.dest_entry[dst_dev_id];
}

if (routing_table[routing_plane].port_direction.directions[eth_chan_directions::EAST] == next_port) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dup

@ubcheema ubcheema requested a review from a team as a code owner March 4, 2025 01:54
inline void init(uint32_t buffer_id, uint32_t data_buf_start, uint32_t data_buf_size_words) {
uint32_t words = sizeof(fvc_outbound_push_state) / 4;
uint32_t* ptr = (uint32_t*)this;
for (uint32_t i = 0; i < words; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed? This is zeroing the struct but we are reassigning what was just set to zero below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its initially clearing all the structure data to 0.
Later members are initialized to respective values.
Not all the members end up with non-zero values.

inline void init(uint32_t data_buf_start, uint32_t data_buf_size_words) {
uint32_t words = sizeof(fvc_inbound_push_state) / 4;
uint32_t* ptr = (uint32_t*)this;
for (uint32_t i = 0; i < words; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all structure data members get set later in the init code.

continue;
}
uint32_t forwarding_channel = routing_table->port_direction.directions[i];
if (forwarding_channel == INVALID_DIRECTION) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check for this INVALID_DIRECTION?
If the direction was truly invalid as the name implies, a corrupt value like 123, wouldn't be caught because the forwarding channel is a uint32_t and it won't match to any of the enum values.
If INVALID_DIRECTION means to skip this direction then consider renaming it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this commit. So we can update this later.
control plane sets the port directions when its initialized.
In this code, we are only consuming the entries in the routing table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the initializations will eventually move to host code and removed from kernel.

template <uint8_t fvc_mode = FVC_MODE_ROUTER>
void register_with_routers(uint32_t device_id, uint32_t mesh_id = 0) {
if constexpr (fvc_mode == FVC_MODE_ROUTER) {
tt::tt_fabric::chan_id_t my_chan = routing_table->intra_mesh_table.dest_entry[device_id];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where & how is this routing_table initialized?
globally declared and defined by each includer?

STREAM_ID_NOC_WORDS_RECEIVED + my_direction, STREAM_REMOTE_DEST_BUF_SPACE_AVAILABLE_UPDATE_REG_INDEX));
remote_buffer_start = FABRIC_ROUTER_DATA_BUF_START + my_direction * FABRIC_ROUTER_OUTBOUND_BUF_SIZE;

for (uint32_t i = 0; i < eth_chan_directions::COUNT; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i = eth_chan_directions::EAST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yeah, meant to to that.

Comment on lines +509 to +521
if (dst_mesh_id != routing_table->my_mesh_id) {
uint32_t next_port = routing_table->inter_mesh_table.dest_entry[dst_mesh_id];
remote_wrptr_direction = port_direction_table[next_port];
return eth_chan_to_noc_xy[noc_index][next_port];
} else {
uint32_t dst_device_id = packet_word_0[1] >> 16;
uint32_t next_port = routing_table->intra_mesh_table.dest_entry[dst_device_id];
remote_wrptr_direction = port_direction_table[next_port];
return eth_chan_to_noc_xy[noc_index][next_port];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        uint32_t next_port;
        if (dst_mesh_id != routing_table->my_mesh_id) {
            next_port = routing_table->inter_mesh_table.dest_entry[dst_mesh_id];
        } else {
            uint32_t dst_device_id = packet_word_0[1] >> 16;
            next_port = routing_table->intra_mesh_table.dest_entry[dst_device_id];
        }
        remote_wrptr_direction = port_direction_table[next_port];
        return eth_chan_to_noc_xy[noc_index][next_port];

Copy link
Contributor

@daminakaTT daminakaTT Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These if else looks really common. How about having private inlined get_next_port or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this code hurts performance.
It is nice to look at, but not good for performance.
local scope variable and return allow the code to not maintain register variables across the if/else block.

}

template <AsyncWriteMode mode = AsyncWriteMode::ALL>
inline void fabric_async_write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the client_interface type to the template as well so that we have just one definition of the async write API?

client_interface->buffer_size = FABRIC_ROUTER_OUTBOUND_BUF_SIZE / PACKET_WORD_SIZE_BYTES;
client_interface->wr_ptr = 0;
client_interface->buffer_start = FABRIC_ROUTER_DATA_BUF_START + direction * FABRIC_ROUTER_OUTBOUND_BUF_SIZE;
client_interface->router_push_addr = (STREAM_REG_ADDR(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this PR, but should we add all the STREAM REG handling in a separate utility file? If we have all the handling in a single place we could also keep a track of what registers are in use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can have it in a separate stream register utility file in a different PR.

@ubcheema
Copy link
Contributor Author

ubcheema commented Mar 4, 2025

Guys, In general, if there are any hazards or missing functionality, please comment on that.
Don't use this PR to learn about the code.
I am not entertaining code beautification suggestions in this PR.
We need to get this feature pushed to main.
It is not in its final form anyway. So any code rearrangement at this point is not very beneficial.
As more functionality gets added, code will morph, and I will certainly incorporate code rearrangement suggestions.

@nhuang-tt nhuang-tt self-requested a review March 5, 2025 17:49
@ubcheema ubcheema force-pushed the ucheema/tt-fabric-push-router-support branch 2 times, most recently from 4dd37e4 to e5cd1af Compare March 6, 2025 20:22
Copy link
Contributor

@aliuTT aliuTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new tests should work if you remove TT_METAL_SLOW_DISPATCH_MODE=1 and use fast dispatch?

@ubcheema
Copy link
Contributor Author

ubcheema commented Mar 6, 2025

These new tests should work if you remove TT_METAL_SLOW_DISPATCH_MODE=1 and use fast dispatch?

I'll check that.

@ubcheema ubcheema requested a review from davorchap as a code owner March 7, 2025 14:43
@ubcheema ubcheema force-pushed the ucheema/tt-fabric-push-router-support branch from 7a59971 to d8ec54a Compare March 7, 2025 16:25
@ubcheema
Copy link
Contributor Author

ubcheema commented Mar 7, 2025

Ran successful 1000 run loop for TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/perf_microbenchmark/routing/test_tt_fabric_sanity_wormhole_b0 --fabric_command 1 --board_type t3k --data_kb_per_tx 10 --push_router --metal_fabric_init_level 1

https://github.com/tenstorrent/tt-metal/actions/runs/13724561170

@ubcheema ubcheema force-pushed the ucheema/tt-fabric-push-router-support branch from d8ec54a to c85e0fd Compare March 7, 2025 17:52
        Use get_noc_addr_helper( ) when making NOC address.
        Roll sender buffer space register address in to a loop.
        Add tests to CI
        Fix bug where ethernet receiver was returning extra credits to ethernet sender.
        Hook up device init for push router.
@ubcheema ubcheema force-pushed the ucheema/tt-fabric-push-router-support branch from c85e0fd to 192720c Compare March 7, 2025 17:56
@ubcheema ubcheema merged commit 0a995b8 into main Mar 7, 2025
34 checks passed
@ubcheema ubcheema deleted the ucheema/tt-fabric-push-router-support branch March 7, 2025 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants