-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Affinity in DDFacet not compatible with slurm system #36
Comments
Looks like there's a bug in AsyncProcessPool.py line 276:
|
Could you put your working change in a pull request? |
Sure, done. |
Hi, I came across this issue as I was experiencing the similar concerns about DDF on slurm environment. I have created a custom script that can read correct Cores allocated for a slurm Job and update the DDF config file before running DDF. The script assumes that in the parallel section of the default DDF config "MainProcessAffinity = 0". Running this will then overide the value of the import os
import subprocess
import psutil
logical_cpus = psutil.cpu_count(logical=True)
physical_cpus = psutil.cpu_count(logical=False)
print(f"Logical CPUs: {logical_cpus}")
print(f"Physical CPUs: {physical_cpus}")
print(f"checking affinity length: {len(psutil.Process().cpu_affinity())}")
process = psutil.Process() # Current process
print(f"The list of CPU's a process can run on: {process.cpu_affinity()}")
cpu_affinity_list = process.cpu_affinity()
# Define the file name
filename = "xxx.parset"
# Define the target line content to search for and the replacement content
target_line = "MainProcessAffinity = 0"
replacement_line = f"MainProcessAffinity = {cpu_affinity_list[0]}"
# Read the file and replace the line if it matches the target
with open(filename, "r") as file:
lines = file.readlines()
# Modify the line containing the target
with open(filename, "w") as file:
for line in lines:
# Check if the line contains the target and replace it
if line.strip() == target_line:
file.write(replacement_line + "\n")
else:
file.write(line) My parallel section of the DDF config when testing this was [Parallel]
NCPU = 32
Affinity = 1
MainProcessAffinity = 0 |
Yes, this is the workround if there's no fix within DDFacet itself. We discussed this extensively over in https://github.com/cyriltasse/DDFacet/issues/657 some years ago but it seems that nobody ever implemented a sane default behaviour. |
OK, I think this needs a better fix. I have been looking at this for Tim and the workround above, though it looks like it works, actually doesn't on my test system -- it forces everything to run on one core, the one specified by Parallel-MainProcessAffinity. Moreover, disabling affinity with --Parallel-Affinity=disabled doesn't work at all if the cores available are not in the range(0,ncpu), because the code still attempts to set worker affinity to a number in that range. I've implemented some logic that (a) works correctly if Parallel-Affinity=disabled (or 0) and (b) if Parallel-Affinity=1 starts from psutil.Process().cpuaffinity() to get the available CPU list and then works things out accordingly if it finds that there are fewer CPUs available than psutil.cpu_count(). The big issue is that this still doesn't work with something that attempts to set the parent process affinity. We have at the moment in master in
which does nothing if parent_affinity==0 (the default). As soon as this is set to a number other than zero, then it appears that all children inherit that affinity, even if they are explicitly set to have some other affinity (which they are). I don't understand this entirely, but I don't think the parent affinity code has ever worked as intended. And if the line to set parent affinity is commented out, everything else appears to work, though it's difficult to tell if the worker processes are indeed bound to the cores they're supposed to be bound to. I am putting in a PR for my working code, please don't merge it yet but it shows what is needed to get this to work. |
Hi,
We alluded to this issue briefly in issue #35 but I wanted to create a new issue for it as it wan't addressed and has become a showstopper for me in terms of being able to run DDFacet on the cluster I'm using (which operates using slurm). Essentially, DDFacet cannot tell if you are only using a subset of cores -- it counts all cores that exist on the node (and leads to the error message attached as a screenshot). This is a problem if you are unable to use all cores on a node (the case for me). In the example below, I requested 32 cores (the max I'm allowed to request) and there are 36 cores total on the node:
I tried to circumvent this whole issue by simply turning affinity off using the option --Parallel-Affinity=disable. This however does not work for me either and, given the error message (see log file attached), it seems as though the code still thinks affinity is on. Any ideas on how I can get the software to run without running into these affinity problems? Or more simply, how I can definitively turn affinity off?
slurm-26674639.txt
Thanks!
Kelly
The text was updated successfully, but these errors were encountered: