-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Multiplexed-round-robin scheduler #604
Conversation
The CI failure is a real issue. We need to figure out how this can happen. |
e12c9d9
to
d6b7909
Compare
d6b7909
to
63d7b1b
Compare
c849d64
to
e1d4db3
Compare
a9cc48c
to
959f979
Compare
5194acd
to
5efc6f1
Compare
5fcd9f9
to
be09cd9
Compare
1106127
1106127
to
69e058b
Compare
1c4daa5
to
948fd4e
Compare
b1a4cb2
to
2406077
Compare
after passing every other box, the runner lost network, couldn't query pypi, and timed out on
bot:aws:retest |
I rebased #627 on this and did a build: link
I was going to suggest making num_rails unsigned in the signature, but if you follow this all the way out, it ultimately comes from just matching the choice of the type on If this passes jenkins on this go around, please push and I'll rebase on it and amend the commit for #575 with a fixup. (I'd like to drop #576 anyway.) Otherwise, if you end up having to rebase on #627, please do |
1330257
to
5995682
Compare
6c157db
to
a5ade69
Compare
This commit modifies the scheduler algorithm to round-robin the payload_msg on each QP or betwen pairs of QP's or triplets of QP's or quadruplets of QP's based on the min_stripe_size.
b3db3e0
to
fa1f2d9
Compare
int ret = 0; | ||
*schedule = (nccl_net_ofi_schedule_t *)malloc(sizeof(nccl_net_ofi_xfer_info_t) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
went to fix some warnings that this introduced under C++, came across this. This is wrong and something we missed in review.
You meant to do:
*schedule = (nccl_net_ofi_schedule_t*)malloc(
sizeof(nccl_net_ofi_schedule_t) +
(num_xfer_infos * sizeof(nccl_net_ofi_xfer_info_t)));
Looks to me like this was just a typo, in which case, you can ignore the rest of this comment... but if you were confused and afraid to ask and just copy-pasted from somewhere, I wouldn't blame you, because unless you've come across flexible arrays at some point in the past it's a really weird hack that deserves some explanation.
Structs with flexible array members look like this:
struct foo {
int a;
int b;
uint64_t flex[];
}
Or like this:
struct foo {
int a;
int b;
uint64_t flex[0];
}
The way this works is that the sizeof()
operator on this struct does not account for the last member (the flexible member) in any way:
static_assert(sizeof(struct foo) == (sizeof(int)*2));
This means that if you call:
struct foo *my_foo = (struct foo*)malloc(sizeof(struct foo))`
You're only allocating space for a
and b
, but any access of my_foo->flex
would be illegal. But the power of this is that if you do:
struct foo *my_foo = (struct foo*)malloc(
sizeof(struct foo) +
(nelems * sizeof(uint64_t)))
You can then access up to my_foo->flex[nelems-1]
legally and in a well-defined way.
Some things to watch out for: the flexible array member must always be the last member, and from that rule it follows that there can only ever be one flexible array member in any given struct, and it also follows that if there is some other struct bar
containing a struct foo
, that member must come last (and recursively for any struct containing bar
, and so on):
struct bar {
// struct foo my_foo; // not ok
int c;
int d;
struct foo my_foo; // ok
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation on flexible array members Nick.
I think I did copy from a previous reference in the code. The explanation really helped.
I believe in this case we should be good as rail_xfer_infos is the last member of the struct nccl_net_ofi_schedule
typedef struct nccl_net_ofi_schedule {
/* Number of transfer information entries set by the scheduler */
size_t num_xfer_infos;
/* Array of transfer information structs. The array has at
* least 'num_xfer_infos' entries. */
nccl_net_ofi_xfer_info_t rail_xfer_infos[];
} nccl_net_ofi_schedule_t;
I have posted another PR to fix the typo in the function declaration
Thank you
Description of changes:
This change modifies the current scheduling policy such that we either round-robin on each rail or a pair of rails or a triplet of rails or a quadruplet of rails based on the min_stripe_size. Within the group of rails the message is divided by the stripe_size and multiplexed on each rail.
The num_stripes to be multiplexed is calculated using the below formula,
num_stripes = min ( ceil ( msg_size / min_stripe_size ) , num_rails )
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.