Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scheduler] Various optimizations to task stealing #9594

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

RobinMorisset
Copy link
Contributor

Profiling some services on many-core machines showed very high lock contention in the scheduler, specifically in the task-stealing mechanism.
This PR has a bunch of commits that both shrink the related critical sections, and reduce contention on these locks.

Testing:

  • make emulator_test ARGS="-suite process_SUITE"
  • tested in production on a many-core server, it showed very large improvements at low to medium load (on the order of 30% vs 40% cpu utilization), and tiny but measurable improvements at high load (about 1-2% cpu utilization). Lock-counting confirmed a fall in the contention of the runqueue locks.

When a scheduler runqueue is empty, that scheduler attempts to steal
work from another (hopefully non-empty) runqueue. Currently, the
sequence of events looks like this:
- Notice that there is no work to do
- Set a flag saying "don't steal work from me, I'm busy stealing"
- Remove a process from some other runqueue
- Put it back on your own runqueue
- goto the beginning of the scheduler loop
- Notice that there is now work to do
- Take the process from your runqueue (that you just added it to)
- Unset the flag
- Execute that process
The new sequence looks like this:
- Notice that there is no work to do
- Remove a process from some other runqueue
- Execute that process

There is no more need for a flag to protect our runqueue from stealing,
since the process we stole never spends time in it (and so is never at
risk of being stolen back from us which could lead to some kind of
livelock)
Currently, the scheduler uses work-stealing and steals a single task at
a time. This means that as soon as it is done with that one task, it
needs to steal another one, and another, etc...
All of this task stealing needs to take locks and can be a bottleneck.
As a first step to remove this bottleneck, this patch changes the
scheduler to steal half the tasks in a queue at once.
Instead of unlocking at the last moment and locking again right after,
we unlock at the beginning of try_steal_task, and only re-lock at the
end of try_steal_task.

The only access to rq in that time is rq->ix, which is only written to
once in erts_init_scheduling, so there should be no new race.
We now steal up to 100 tasks at once. But we were updating the victim
runqueue length by repeated decrements, and updating the beneficiary
runqueue length by repeated increments. This is not just extra work: it
also increased the length of the critical sections (since these updates
were done while holding the relevant locks).

This change batches the increments/decrements, doing a single
addition/substraction.
Currently we steal half the tasks by alternating between taking a
process and leaving it behind. With this commit we instead compute how
many tasks to steal, and then steal that many without skipping any
non-bound processes.
This should halve the number of pointer dereferenced during task
stealing, and thus also halve the number of cache misses, making that
(highly contended) critical section faster
Instead of using a stack to store the tasks being stolen, and turn them
back into a linked list at the end, we can store them as a linked-list
throughout. The main benefit is shrinking the critical section where we
re-enqueue them.
If we cannot immediately take the lock required to steal a task, we skip
that runqueue for the moment and retry it only once we have looked at
all other runqueues. This mitigates the performance hit
we see where dozens of threads try to steal from the same runqueue at
once, by spreading the load.
Currently we're checking it in every iteration in the fast loops (and
this shows in profiling), and not at all in the final loop (which is
the only one that can be slow enough to deserve it). This just fixes
this small point I had forgotten in the last commit.
Copy link
Contributor

github-actions bot commented Mar 17, 2025

CT Test Results

    3 files    141 suites   50m 51s ⏱️
1 614 tests 1 564 ✅ 50 💤 0 ❌
2 334 runs  2 264 ✅ 70 💤 0 ❌

Results for commit 8ac8af9.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

The code was structured as follows:
```
if (is_normal_sched) {
  ..
} else {
  ..
  qbit = ..
}
..
if (!is_normal_sched)
  clear_proc_dirty_queue_bit(p, rq, qbit);
```
and it resulted in an error deep in clear_proc_dirty_queue_bit (probably
because of some inlining): "error: ‘qbit’ may be used uninitialized in this
function [-Werror=maybe-uninitialized]"

This issue was previously hidden because of the control flow leading to
this part of the code being a bit different, but that changed with the
commit to "directly execute stolen processes", revealing this limitation
of the compiler warning.

This commit fixes it, simply by pushing the initialization of qbit
outside of the "if (is_normal_sched)" condition.
@RobinMorisset
Copy link
Contributor Author

I could not reproduce the failure of one test in monitor_SUITE locally. Is there some way to access detailed logs of what is going wrong? And is this test known to be flaky?

@garazdawi
Copy link
Contributor

I could not reproduce the failure of one test in monitor_SUITE locally. And is this test known to be flaky?

We have a couple of failures of that testcase in our lab, so yes it can be a bit flaky. I re-triggered the tests so that we can see if it fails again.

@garazdawi garazdawi added team:VM Assigned to OTP team VM full-build-and-check Run a more thorough testing of this PR labels Mar 20, 2025
@garazdawi
Copy link
Contributor

I restarted the CI jobs again in order for this PR to run the tests in all applications, not just emulator.

@garazdawi
Copy link
Contributor

You can have a look here to see which tests currently fail on a clean master build: https://erlang.org/github-pr/master/ct_logs/

@garazdawi
Copy link
Contributor

I restarted the CI jobs again in order for this PR to run the tests in all applications, not just emulator.

apparently that did not work, it should work next time you push a fix to the PR though.

@RobinMorisset
Copy link
Contributor Author

The tests are all green this time, so I think this PR is ready for review.
Thanks for restarting these tests, and for confirming that the failing test was known to be flaky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full-build-and-check Run a more thorough testing of this PR team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants