Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable split reader for halo #18394

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Enable split reader for halo #18394

wants to merge 15 commits into from

Conversation

nkpatel-tt
Copy link
Contributor

@nkpatel-tt nkpatel-tt commented Feb 27, 2025

Ticket

#17982

Problem description

Currently, there is an imbalance on work done by each reader cores since pad and local configs are processed on one core and remote config on other core.

What's changed

Enable split reader which would load balance.

Checklist

@nkpatel-tt nkpatel-tt self-assigned this Feb 27, 2025
@nkpatel-tt nkpatel-tt requested a review from a team as a code owner February 27, 2025 06:41
@nkpatel-tt nkpatel-tt requested a review from mbezuljTT February 27, 2025 08:16
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
…or not for conv2d.

Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
@pavlejosipovic
Copy link
Contributor

Is there there a reason to make halo split reader a new option, we have too many already.
Can we make run always?

@nkpatel-tt
Copy link
Contributor Author

Is there there a reason to make halo split reader a new option, we have too many already. Can we make run always?

Since this will affect conv, max pool, bilinear, I will need to check for all of them and models that this change does not regress any of them performance wise. Also, might have to update perf numbers if required. I shall do same in another PR (if perf is good enough)and remove this option.
BTW, I have seen that for some cases where there are 2-3 reads per core, performance almost matches with existing code so I need to be sure before switching it permanently.

@mywoodstock
Copy link
Contributor

Is there there a reason to make halo split reader a new option, we have too many already. Can we make run always?

Is there there a reason to make halo split reader a new option, we have too many already. Can we make run always?

Since this will affect conv, max pool, bilinear, I will need to check for all of them and models that this change does not regress any of them performance wise. Also, might have to update perf numbers if required. I shall do same in another PR (if perf is good enough)and remove this option. BTW, I have seen that for some cases where there are 2-3 reads per core, performance almost matches with existing code so I need to be sure before switching it permanently.

Yeah, we also don't want to disrupt the on-going work on in-place halo, so we will enable the split reader by default once all support is working.

// Skip every odd iteration for writer in local configuration
if (iteration % 2 == 1) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of these if conditions, can we just not increment iteration by 2 always, just the start value will be either 0 or 1 for the two readers, resp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition helps to balance the load better in case of less work, say (2,2) kernel with 0 pad and 1 dilation. The idea is to start reading even iterations for reader for local config and odd iterations for remote config for reader on same core and other way around for other core. So in case of single local and single remote read, both cores will read instead of single. In above mentioned example(test case in marko's branch), I have seen perf difference with and without this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view on this is halo_gather.cpp shouldn't change at all. I know this was how it was done in my implementation, and might have been miss-leading - I'm sorry about that.

There might be more strategies on how/what we divide between cores, with that on mind sliding_window::generate_halo_kernel_config_tensors(..) could return two sets of flattened_pad_config, flattened_local_config, flattened_remote_config per each data movement processor.

@mywoodstock @pavlejosipovic what do you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree 100% with what Marko has suggested, I was going to suggest the same thing.
Spliting work in sliding windows infra seems more robust, it should remove burden from runtime doing that,
it should play well with inplace halo that Billy is doing and it should play well implementing overlapped untilize + pad/copysticks that Evan is doing.

Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
@nkpatel-tt nkpatel-tt requested a review from mywoodstock March 3, 2025 11:01
[ttnn.bfloat16],
)
@pytest.mark.parametrize("math_fidelity", [ttnn.MathFidelity.LoFi])
@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please test for different strides as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one more stride, 4. Let me know if you want to test any specific conf. for your model.

@pytest.mark.parametrize("stride", [1])
@pytest.mark.parametrize("enable_halo_split_reader", [True])
@pytest.mark.parametrize("device_params", [{"l1_small_size": 16384*2}], indirect=True)
def test_halo_split_reader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be more of a tactical question, why forking test_conv_features instead of introducing enable_halo_split_reader at test_conv_features?

 @pytest.mark.parametrize("enable_halo_split_reader", [True, False])
def test_conv_features(
..

to me as an outsider, this file is becoming unreadable. definitely a call for @mywoodstock and @pavlejosipovic to decide what approach you want to take

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the conv features is more for different combination of convolution itself. I did not want to clutter it with halo related test cases. Besides in case of any failure, Current test cases seem more explicit. @mywoodstock @pavlejosipovic let me know on your opinion.

// Skip every odd iteration for writer in local configuration
if (iteration % 2 == 1) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view on this is halo_gather.cpp shouldn't change at all. I know this was how it was done in my implementation, and might have been miss-leading - I'm sorry about that.

There might be more strategies on how/what we divide between cores, with that on mind sliding_window::generate_halo_kernel_config_tensors(..) could return two sets of flattened_pad_config, flattened_local_config, flattened_remote_config per each data movement processor.

@mywoodstock @pavlejosipovic what do you prefer?

Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
@pavlejosipovic
Copy link
Contributor

Is there there a reason to make halo split reader a new option, we have too many already. Can we make run always?

Is there there a reason to make halo split reader a new option, we have too many already. Can we make run always?

Since this will affect conv, max pool, bilinear, I will need to check for all of them and models that this change does not regress any of them performance wise. Also, might have to update perf numbers if required. I shall do same in another PR (if perf is good enough)and remove this option. BTW, I have seen that for some cases where there are 2-3 reads per core, performance almost matches with existing code so I need to be sure before switching it permanently.

Yeah, we also don't want to disrupt the on-going work on in-place halo, so we will enable the split reader by default once all support is working.

I understand that this is a lower risk option, but there are some benefits to doing with an opt-in flag.

  1. No need to complicate our test matrix
  2. Higher confidence that feature is robust (perf and correctness) and that can be used freely by model writhers), as it would be much better tested on both accounts. With currently version we would have no perf tests on CI.
  3. No need change Conv2d API, than break if we later on want to remove the flag.
  4. Not much more value, compared to what we already have (implementation on a branch that we can use for one customer)

@mywoodstock
Copy link
Contributor

  1. I completely agree with the approach @mbezuljTT mentions where we split the config tensors for the two data movement cores, and not change the kernel itself. @nkpatel-tt Could you update to use this approach? It will be good to keep these split calculations in pre-processing instead of at kernel runtime, and keep the kernel lean.
  2. Regarding the extra flag, makes sense to always keep it on. Lets go with that then.

Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
@nkpatel-tt nkpatel-tt force-pushed the nkpatel/halo-kernel-opt branch from 6309bb6 to f1dc5bc Compare March 6, 2025 11:01
@nkpatel-tt nkpatel-tt requested a review from mbezuljTT March 6, 2025 11:04
@mbezuljTT
Copy link
Contributor

thanks, could you please share perf numbers after the update 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants