-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable split reader for halo #18394
base: main
Are you sure you want to change the base?
Enable split reader for halo #18394
Conversation
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
…or not for conv2d. Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
bf2d051
to
1da69fe
Compare
Is there there a reason to make halo split reader a new option, we have too many already. |
Since this will affect conv, max pool, bilinear, I will need to check for all of them and models that this change does not regress any of them performance wise. Also, might have to update perf numbers if required. I shall do same in another PR (if perf is good enough)and remove this option. |
Yeah, we also don't want to disrupt the on-going work on in-place halo, so we will enable the split reader by default once all support is working. |
// Skip every odd iteration for writer in local configuration | ||
if (iteration % 2 == 1) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of these if conditions, can we just not increment iteration
by 2 always, just the start value will be either 0 or 1 for the two readers, resp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition helps to balance the load better in case of less work, say (2,2) kernel with 0 pad and 1 dilation. The idea is to start reading even iterations for reader for local config and odd iterations for remote config for reader on same core and other way around for other core. So in case of single local and single remote read, both cores will read instead of single. In above mentioned example(test case in marko's branch), I have seen perf difference with and without this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My view on this is halo_gather.cpp shouldn't change at all. I know this was how it was done in my implementation, and might have been miss-leading - I'm sorry about that.
There might be more strategies on how/what we divide between cores, with that on mind sliding_window::generate_halo_kernel_config_tensors(..) could return two sets of flattened_pad_config, flattened_local_config, flattened_remote_config per each data movement processor.
@mywoodstock @pavlejosipovic what do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree 100% with what Marko has suggested, I was going to suggest the same thing.
Spliting work in sliding windows infra seems more robust, it should remove burden from runtime doing that,
it should play well with inplace halo that Billy is doing and it should play well implementing overlapped untilize + pad/copysticks that Evan is doing.
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
[ttnn.bfloat16], | ||
) | ||
@pytest.mark.parametrize("math_fidelity", [ttnn.MathFidelity.LoFi]) | ||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please test for different strides as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one more stride, 4. Let me know if you want to test any specific conf. for your model.
@pytest.mark.parametrize("stride", [1]) | ||
@pytest.mark.parametrize("enable_halo_split_reader", [True]) | ||
@pytest.mark.parametrize("device_params", [{"l1_small_size": 16384*2}], indirect=True) | ||
def test_halo_split_reader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be more of a tactical question, why forking test_conv_features
instead of introducing enable_halo_split_reader
at test_conv_features
?
@pytest.mark.parametrize("enable_halo_split_reader", [True, False])
def test_conv_features(
..
to me as an outsider, this file is becoming unreadable. definitely a call for @mywoodstock and @pavlejosipovic to decide what approach you want to take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the conv features is more for different combination of convolution itself. I did not want to clutter it with halo related test cases. Besides in case of any failure, Current test cases seem more explicit. @mywoodstock @pavlejosipovic let me know on your opinion.
// Skip every odd iteration for writer in local configuration | ||
if (iteration % 2 == 1) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My view on this is halo_gather.cpp shouldn't change at all. I know this was how it was done in my implementation, and might have been miss-leading - I'm sorry about that.
There might be more strategies on how/what we divide between cores, with that on mind sliding_window::generate_halo_kernel_config_tensors(..) could return two sets of flattened_pad_config, flattened_local_config, flattened_remote_config per each data movement processor.
@mywoodstock @pavlejosipovic what do you prefer?
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
I understand that this is a lower risk option, but there are some benefits to doing with an opt-in flag.
|
|
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
Signed-off-by: Nilaykumar Patel <nkpatel@tenstorrent.com>
6309bb6
to
f1dc5bc
Compare
thanks, could you please share perf numbers after the update 🙏 |
Ticket
#17982
Problem description
Currently, there is an imbalance on work done by each reader cores since pad and local configs are processed on one core and remote config on other core.
What's changed
Enable split reader which would load balance.
Checklist