Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.13.2 staging #740

Closed

Conversation

arunkarthik-akkart
Copy link
Contributor

Description of changes:
Add the below commit to v1.13.x-aws branch

8e3728e - Reapply "defaults: make dmabuf opt-in"

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

arunkarthik-akkart and others added 12 commits December 5, 2024 01:30
The extend function needs to take two points in a plane,
calculate the slope and then find the projected rank (y)
for TUNER_MAX_SIZE (x) based on the slope.
However, If the projected point(y) is greater than
TUNER_MAX_RANKS, it should instead calculate the projected msg_size(x)
for TUNER_MAX_RANKS. This will ensure that the polygon does not
exceed (TUNER_MAX_SIZE, TUNER_MAX_RANKS).
Crrently this condition is reversed and this commit inverts the
condition.

Signed-off-by: Arun Karthik <akkart@amazon.com>
(cherry picked from commit 4d06965)
Currently the P5en is calibrated from 2 to 128 nodes on EFA.
This corresponds to 16 to 1024 ranks on AR 0x0, AG 0x0 and
RS 0x0. But on AR 0x7, AG 0x7 and RS 0x7 the corresponding ranks are
2 to 128.
This commit divides all the 0x7 polygon points by 8 to re-calibrate
the tuner.

Signed-off-by: Arun Karthik <akkart@amazon.com>
(cherry picked from commit fbb2a45)
Currently there exists no unit test for the extend logic in the
region_based_tuner. This commit adds some unit tests to check for
points that the extrapolated by the extend_region function in the
below cases :
1. When the two points are on a horizontal line
2. When the two points are on a vertical line
3. When the point extended reaches the upper bound for TUNER_MAX_SIZE
4. When the point extended reaches the upper bound for TUNER_MAX_RANKS

Signed-off-by: Arun Karthik <akkart@amazon.com>
(cherry picked from commit 3c98f59)
Currently the intersection point is of the two lines is calculated as:

double a = (vcross(x0, dx) - vcross(y0, dx)) / d;
This division is reducing the accuracy when calculating the intersection
of long lines. Converting `a` to long double maintains the
accuracy.

Signed-off-by: Arun Karthik <akkart@amazon.com>
(cherry picked from commit b51e3d5)
Added unit tests to check if a point is either inside the polygon,
outside the polygon or on the edge of the polygon.

Signed-off-by: Arun Karthik <akkart@amazon.com>
(cherry picked from commit bcb2e96)
Add the recently added tuner region unit test binary to the list of
files that git should ignore.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit e66dc5e)
Creating a sane cache-key that works across branches and across pull
requests is more difficult than it appears from a quick glance at the
docs. We have been running this in a cron schedule, but not using the
cached artifact, and my efforts to actually use the cached tarball have
not been successful. Delete this job.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
(cherry picked from commit 2ad886c)
On all pushes to tags, generate and upload a make dist archive. This can
be potentially be extended in the future to create a github release on
every tag, and to upload the tarball as a release artifact, in a future
commit.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
(cherry picked from commit 56a4d79)
Adding the tuner regions for All Gather 0x0 and Reduce Scatter 0x0
These regions were removed from the region based tuner because of
the bug in the polygon extend logic. After the bug is fixed,
adding back the regions for AG0x0 and RS0x0.

Signed-off-by: Arun Karthik <akkart@amazon.com>
(cherry picked from commit 8f86911)
Currently the Ring-LL region for AG0x0 and RS0x0 starts with
the point {0, 16} and extends till {TUNER_MAX_SIZE, TUNER_MAX_RANKS}.
But the polygon was not completely closed without a point on the y-axis.
Added a points {0, TUNER_MAX_RANKS}. This ensures the polygon is closed
correctly.

Signed-off-by: Arun Karthik <akkart@amazon.com>
(cherry picked from commit fbae0f9)
The LL128 protocol causes an interesting pattern where there's a
long string of large-ish messages and then a "leftover", most
notable on power of 2 message sizes like are used in benchmarks.
If that leftover bit is eager sized, you end up with an eager
message queued behind a bunch of writes, which doesn't gain
you anything in latency hiding the receive request, but then
costs you the fixup on the remote side.

This patch implements an overly simplistic version of runting,
which skips eager if there is already a write outstanding on
the communicator.  This is wrong for a couple reasons.  First,
we should probably use a BDP or similar to determine how much
data should be in flight before we skip the eager protocol.
Second, we should really decrement the counter on cqe processing
time, but that is a locking mess.  And finally, it should really
be a domain counter instead of a communicator counter, but again,
see the locking mess comment.  The NCCL communication patterns
are generally simple enough that none of these really seem
to matter, so we go with simplicity for now.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 43b7bcd)
This reverts commit 224593f.

Our shared development cluster seems to have issues with dmabuf when
running NCCL tests, for a handful of niche situations, ie: two nodes,
with MPI_Comm_split equal to the number of GPUs, at 16GB+. Other
environments seem not to have issues with the same workload, but out of
an abundance of caution and due to a lack of root cause, this is being
reverted again.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
(cherry picked from commit 1a46a67)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants