Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Multiplexed-round-robin scheduler #604

Merged
1 commit merged into from
Oct 5, 2024
Merged

Conversation

arunkarthik-akkart
Copy link
Contributor

@arunkarthik-akkart arunkarthik-akkart commented Sep 16, 2024

Description of changes:
This change modifies the current scheduling policy such that we either round-robin on each rail or a pair of rails or a triplet of rails or a quadruplet of rails based on the min_stripe_size. Within the group of rails the message is divided by the stripe_size and multiplexed on each rail.

The num_stripes to be multiplexed is calculated using the below formula,
num_stripes = min ( ceil ( msg_size / min_stripe_size ) , num_rails )

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@arunkarthik-akkart arunkarthik-akkart requested review from rauteric and a team as code owners September 16, 2024 17:33
@AmedeoSapio
Copy link

AmedeoSapio commented Sep 16, 2024

The CI failure is a real issue. We need to figure out how this can happen.
gather_perf: nccl_ofi_rdma.c:223: get_send_comm_rail: Assertion rail_id < s_comm->num_init_rails failed.

@arunkarthik-akkart arunkarthik-akkart force-pushed the rrx2 branch 2 times, most recently from e12c9d9 to d6b7909 Compare September 18, 2024 17:48
@arunkarthik-akkart arunkarthik-akkart force-pushed the rrx2 branch 2 times, most recently from c849d64 to e1d4db3 Compare September 19, 2024 23:40
@arunkarthik-akkart arunkarthik-akkart force-pushed the rrx2 branch 3 times, most recently from a9cc48c to 959f979 Compare September 22, 2024 12:06
@ghost ghost added the previously-passed-ci label Sep 24, 2024
@arunkarthik-akkart arunkarthik-akkart force-pushed the rrx2 branch 4 times, most recently from 5194acd to 5efc6f1 Compare September 25, 2024 22:49
rajachan
rajachan previously approved these changes Sep 26, 2024
bwbarrett
bwbarrett previously approved these changes Sep 26, 2024
@arunkarthik-akkart arunkarthik-akkart force-pushed the rrx2 branch 3 times, most recently from 5fcd9f9 to be09cd9 Compare September 26, 2024 20:55
rauteric
rauteric previously approved these changes Sep 30, 2024
@ghost
Copy link

ghost commented Oct 1, 2024

after passing every other box, the runner lost network, couldn't query pypi, and timed out on pip install ./pf.

Command . /home/ubuntu/PortaFiducia/env/bin/activate; pip install -Ue /home/ubuntu/PortaFiducia failed with error:
  error: subprocess-exited-with-error
[...]
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7cf6edc4cdc0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/setuptools/
ERROR: Could not find a version that satisfies the requirement setuptools>=40.8.0 (from versions: none)
ERROR: No matching distribution found for setuptools>=40.8.0

bot:aws:retest

@ghost
Copy link

ghost commented Oct 1, 2024

I rebased #627 on this and did a build: link

-Wsign-compare does not like this line:

../include/nccl_ofi_math.h:29:46: error: signed and unsigned type in conditional expression [-Werror=sign-compare]
 #define  NCCL_OFI_MIN(x, y) ((x) < (y) ? (x) : (y))
                                              ^
../include/nccl_ofi_math.h:24:42: note: in definition of macro 'NCCL_OFI_MAX'
 #define NCCL_OFI_MAX(x, y) ((x) < (y) ? (y) : (x))
                                          ^
nccl_ofi_scheduler.c:45:42: note: in expansion of macro 'NCCL_OFI_MIN'
  int num_stripes = (int) NCCL_OFI_MAX(1, NCCL_OFI_MIN(NCCL_OFI_DIV_CEIL(size, scheduler->min_stripe_size), num_rails));

I was going to suggest making num_rails unsigned in the signature, but if you follow this all the way out, it ultimately comes from just matching the choice of the type on device->num_rails... int is probably the wrong type there (I don't know what it would mean for a device to have a negative amount of rails) but changing something so far away is a pretty intrusive change to ask for at this point in this PR.

If this passes jenkins on this go around, please push and I'll rebase on it and amend the commit for #575 with a fixup. (I'd like to drop #576 anyway.) Otherwise, if you end up having to rebase on #627, please do (unsigned)num_rails there or similar.

rauteric
rauteric previously approved these changes Oct 1, 2024
ghost
ghost previously approved these changes Oct 3, 2024
This commit modifies the scheduler algorithm to round-robin the
payload_msg on each QP or betwen pairs of QP's or triplets of
QP's or quadruplets of QP's based on the min_stripe_size.
@ghost ghost merged commit bad74fc into aws:master Oct 5, 2024
31 checks passed
int ret = 0;
*schedule = (nccl_net_ofi_schedule_t *)malloc(sizeof(nccl_net_ofi_xfer_info_t) +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went to fix some warnings that this introduced under C++, came across this. This is wrong and something we missed in review.

You meant to do:

*schedule = (nccl_net_ofi_schedule_t*)malloc(
                sizeof(nccl_net_ofi_schedule_t) + 
                (num_xfer_infos * sizeof(nccl_net_ofi_xfer_info_t)));

Looks to me like this was just a typo, in which case, you can ignore the rest of this comment... but if you were confused and afraid to ask and just copy-pasted from somewhere, I wouldn't blame you, because unless you've come across flexible arrays at some point in the past it's a really weird hack that deserves some explanation.

Structs with flexible array members look like this:

struct foo {
  int a;
  int b;
  uint64_t flex[];
}

Or like this:

struct foo {
  int a;
  int b;
  uint64_t flex[0];
}

The way this works is that the sizeof() operator on this struct does not account for the last member (the flexible member) in any way:

static_assert(sizeof(struct foo) == (sizeof(int)*2));

This means that if you call:

struct foo *my_foo = (struct foo*)malloc(sizeof(struct foo))`

You're only allocating space for a and b, but any access of my_foo->flex would be illegal. But the power of this is that if you do:

struct foo *my_foo = (struct foo*)malloc(
        sizeof(struct foo) + 
        (nelems * sizeof(uint64_t)))

You can then access up to my_foo->flex[nelems-1] legally and in a well-defined way.

Some things to watch out for: the flexible array member must always be the last member, and from that rule it follows that there can only ever be one flexible array member in any given struct, and it also follows that if there is some other struct bar containing a struct foo, that member must come last (and recursively for any struct containing bar, and so on):

struct bar {
  // struct foo my_foo; // not ok
  int c;
  int d;
  struct foo my_foo; // ok
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation on flexible array members Nick.
I think I did copy from a previous reference in the code. The explanation really helped.
I believe in this case we should be good as rail_xfer_infos is the last member of the struct nccl_net_ofi_schedule

typedef struct nccl_net_ofi_schedule {
	/* Number of transfer information entries set by the scheduler */
	size_t num_xfer_infos;

	/* Array of transfer information structs. The array has at
	 * least 'num_xfer_infos' entries. */
	nccl_net_ofi_xfer_info_t rail_xfer_infos[];
} nccl_net_ofi_schedule_t;

I have posted another PR to fix the typo in the function declaration

Thank you

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants