-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[scheduler] Various optimizations to task stealing #9594
base: master
Are you sure you want to change the base?
Conversation
When a scheduler runqueue is empty, that scheduler attempts to steal work from another (hopefully non-empty) runqueue. Currently, the sequence of events looks like this: - Notice that there is no work to do - Set a flag saying "don't steal work from me, I'm busy stealing" - Remove a process from some other runqueue - Put it back on your own runqueue - goto the beginning of the scheduler loop - Notice that there is now work to do - Take the process from your runqueue (that you just added it to) - Unset the flag - Execute that process The new sequence looks like this: - Notice that there is no work to do - Remove a process from some other runqueue - Execute that process There is no more need for a flag to protect our runqueue from stealing, since the process we stole never spends time in it (and so is never at risk of being stolen back from us which could lead to some kind of livelock)
Currently, the scheduler uses work-stealing and steals a single task at a time. This means that as soon as it is done with that one task, it needs to steal another one, and another, etc... All of this task stealing needs to take locks and can be a bottleneck. As a first step to remove this bottleneck, this patch changes the scheduler to steal half the tasks in a queue at once.
Instead of unlocking at the last moment and locking again right after, we unlock at the beginning of try_steal_task, and only re-lock at the end of try_steal_task. The only access to rq in that time is rq->ix, which is only written to once in erts_init_scheduling, so there should be no new race.
It is always NULL
We now steal up to 100 tasks at once. But we were updating the victim runqueue length by repeated decrements, and updating the beneficiary runqueue length by repeated increments. This is not just extra work: it also increased the length of the critical sections (since these updates were done while holding the relevant locks). This change batches the increments/decrements, doing a single addition/substraction.
Currently we steal half the tasks by alternating between taking a process and leaving it behind. With this commit we instead compute how many tasks to steal, and then steal that many without skipping any non-bound processes. This should halve the number of pointer dereferenced during task stealing, and thus also halve the number of cache misses, making that (highly contended) critical section faster
Instead of using a stack to store the tasks being stolen, and turn them back into a linked list at the end, we can store them as a linked-list throughout. The main benefit is shrinking the critical section where we re-enqueue them.
If we cannot immediately take the lock required to steal a task, we skip that runqueue for the moment and retry it only once we have looked at all other runqueues. This mitigates the performance hit we see where dozens of threads try to steal from the same runqueue at once, by spreading the load.
Currently we're checking it in every iteration in the fast loops (and this shows in profiling), and not at all in the final loop (which is the only one that can be slow enough to deserve it). This just fixes this small point I had forgotten in the last commit.
CT Test Results 3 files 141 suites 50m 51s ⏱️ Results for commit 8ac8af9. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts// Erlang/OTP Github Action Bot |
The code was structured as follows: ``` if (is_normal_sched) { .. } else { .. qbit = .. } .. if (!is_normal_sched) clear_proc_dirty_queue_bit(p, rq, qbit); ``` and it resulted in an error deep in clear_proc_dirty_queue_bit (probably because of some inlining): "error: ‘qbit’ may be used uninitialized in this function [-Werror=maybe-uninitialized]" This issue was previously hidden because of the control flow leading to this part of the code being a bit different, but that changed with the commit to "directly execute stolen processes", revealing this limitation of the compiler warning. This commit fixes it, simply by pushing the initialization of qbit outside of the "if (is_normal_sched)" condition.
I could not reproduce the failure of one test in monitor_SUITE locally. Is there some way to access detailed logs of what is going wrong? And is this test known to be flaky? |
We have a couple of failures of that testcase in our lab, so yes it can be a bit flaky. I re-triggered the tests so that we can see if it fails again. |
I restarted the CI jobs again in order for this PR to run the tests in all applications, not just emulator. |
You can have a look here to see which tests currently fail on a clean master build: https://erlang.org/github-pr/master/ct_logs/ |
apparently that did not work, it should work next time you push a fix to the PR though. |
The tests are all green this time, so I think this PR is ready for review. |
Profiling some services on many-core machines showed very high lock contention in the scheduler, specifically in the task-stealing mechanism.
This PR has a bunch of commits that both shrink the related critical sections, and reduce contention on these locks.
Testing: