-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.13.2 staging #740
Closed
arunkarthik-akkart
wants to merge
12
commits into
aws:v1.13.x-aws
from
arunkarthik-akkart:1.13.2-staging
Closed
1.13.2 staging #740
arunkarthik-akkart
wants to merge
12
commits into
aws:v1.13.x-aws
from
arunkarthik-akkart:1.13.2-staging
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The extend function needs to take two points in a plane, calculate the slope and then find the projected rank (y) for TUNER_MAX_SIZE (x) based on the slope. However, If the projected point(y) is greater than TUNER_MAX_RANKS, it should instead calculate the projected msg_size(x) for TUNER_MAX_RANKS. This will ensure that the polygon does not exceed (TUNER_MAX_SIZE, TUNER_MAX_RANKS). Crrently this condition is reversed and this commit inverts the condition. Signed-off-by: Arun Karthik <akkart@amazon.com> (cherry picked from commit 4d06965)
Currently the P5en is calibrated from 2 to 128 nodes on EFA. This corresponds to 16 to 1024 ranks on AR 0x0, AG 0x0 and RS 0x0. But on AR 0x7, AG 0x7 and RS 0x7 the corresponding ranks are 2 to 128. This commit divides all the 0x7 polygon points by 8 to re-calibrate the tuner. Signed-off-by: Arun Karthik <akkart@amazon.com> (cherry picked from commit fbb2a45)
Currently there exists no unit test for the extend logic in the region_based_tuner. This commit adds some unit tests to check for points that the extrapolated by the extend_region function in the below cases : 1. When the two points are on a horizontal line 2. When the two points are on a vertical line 3. When the point extended reaches the upper bound for TUNER_MAX_SIZE 4. When the point extended reaches the upper bound for TUNER_MAX_RANKS Signed-off-by: Arun Karthik <akkart@amazon.com> (cherry picked from commit 3c98f59)
Currently the intersection point is of the two lines is calculated as: double a = (vcross(x0, dx) - vcross(y0, dx)) / d; This division is reducing the accuracy when calculating the intersection of long lines. Converting `a` to long double maintains the accuracy. Signed-off-by: Arun Karthik <akkart@amazon.com> (cherry picked from commit b51e3d5)
Added unit tests to check if a point is either inside the polygon, outside the polygon or on the edge of the polygon. Signed-off-by: Arun Karthik <akkart@amazon.com> (cherry picked from commit bcb2e96)
Add the recently added tuner region unit test binary to the list of files that git should ignore. Signed-off-by: Brian Barrett <bbarrett@amazon.com> (cherry picked from commit e66dc5e)
Creating a sane cache-key that works across branches and across pull requests is more difficult than it appears from a quick glance at the docs. We have been running this in a cron schedule, but not using the cached artifact, and my efforts to actually use the cached tarball have not been successful. Delete this job. Signed-off-by: Nicholas Sielicki <nslick@amazon.com> (cherry picked from commit 2ad886c)
On all pushes to tags, generate and upload a make dist archive. This can be potentially be extended in the future to create a github release on every tag, and to upload the tarball as a release artifact, in a future commit. Signed-off-by: Nicholas Sielicki <nslick@amazon.com> (cherry picked from commit 56a4d79)
Adding the tuner regions for All Gather 0x0 and Reduce Scatter 0x0 These regions were removed from the region based tuner because of the bug in the polygon extend logic. After the bug is fixed, adding back the regions for AG0x0 and RS0x0. Signed-off-by: Arun Karthik <akkart@amazon.com> (cherry picked from commit 8f86911)
Currently the Ring-LL region for AG0x0 and RS0x0 starts with the point {0, 16} and extends till {TUNER_MAX_SIZE, TUNER_MAX_RANKS}. But the polygon was not completely closed without a point on the y-axis. Added a points {0, TUNER_MAX_RANKS}. This ensures the polygon is closed correctly. Signed-off-by: Arun Karthik <akkart@amazon.com> (cherry picked from commit fbae0f9)
The LL128 protocol causes an interesting pattern where there's a long string of large-ish messages and then a "leftover", most notable on power of 2 message sizes like are used in benchmarks. If that leftover bit is eager sized, you end up with an eager message queued behind a bunch of writes, which doesn't gain you anything in latency hiding the receive request, but then costs you the fixup on the remote side. This patch implements an overly simplistic version of runting, which skips eager if there is already a write outstanding on the communicator. This is wrong for a couple reasons. First, we should probably use a BDP or similar to determine how much data should be in flight before we skip the eager protocol. Second, we should really decrement the counter on cqe processing time, but that is a locking mess. And finally, it should really be a domain counter instead of a communicator counter, but again, see the locking mess comment. The NCCL communication patterns are generally simple enough that none of these really seem to matter, so we go with simplicity for now. Signed-off-by: Brian Barrett <bbarrett@amazon.com> (cherry picked from commit 43b7bcd)
This reverts commit 224593f. Our shared development cluster seems to have issues with dmabuf when running NCCL tests, for a handful of niche situations, ie: two nodes, with MPI_Comm_split equal to the number of GPUs, at 16GB+. Other environments seem not to have issues with the same workload, but out of an abundance of caution and due to a lack of root cause, this is being reverted again. Signed-off-by: Nicholas Sielicki <nslick@amazon.com> (cherry picked from commit 1a46a67)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes:
Add the below commit to v1.13.x-aws branch
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.