Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Affinity in DDFacet not compatible with slurm system #36

Open
kgourdji opened this issue Apr 7, 2022 · 6 comments
Open

Affinity in DDFacet not compatible with slurm system #36

kgourdji opened this issue Apr 7, 2022 · 6 comments

Comments

@kgourdji
Copy link

kgourdji commented Apr 7, 2022

Hi,

We alluded to this issue briefly in issue #35 but I wanted to create a new issue for it as it wan't addressed and has become a showstopper for me in terms of being able to run DDFacet on the cluster I'm using (which operates using slurm). Essentially, DDFacet cannot tell if you are only using a subset of cores -- it counts all cores that exist on the node (and leads to the error message attached as a screenshot). This is a problem if you are unable to use all cores on a node (the case for me). In the example below, I requested 32 cores (the max I'm allowed to request) and there are 36 cores total on the node:

error1_with_affinity_on

I tried to circumvent this whole issue by simply turning affinity off using the option --Parallel-Affinity=disable. This however does not work for me either and, given the error message (see log file attached), it seems as though the code still thinks affinity is on. Any ideas on how I can get the software to run without running into these affinity problems? Or more simply, how I can definitively turn affinity off?

slurm-26674639.txt

Thanks!
Kelly

@kgourdji
Copy link
Author

Looks like there's a bug in AsyncProcessPool.py line 276:

psutil.Process().cpu_affinity(range(self.ncpu) if not self.parent_affinity else [self.parent_affinity])
this should be within the preceding else statement but currently is not. I changed that and also commented out lines 827 and 828 for good measure and I got DDFacet to work with slurm. Note that I'm not sure whether the latter bit is necessary as I didn't test those two changes individually.

@mhardcastle
Copy link
Collaborator

Could you put your working change in a pull request?

@kgourdji
Copy link
Author

Sure, done.
Note that my solution does not address the problem of DDFacet not being able to ascertain the correct number of cores in a slurm enviornment. I wanted to turn affinity off to circumvent the issue in my screenshot above, and ran into this unrelated bug in the process. For now it seems the only way to use DDFacet in a slurm HPC system is if 1) all cores on a node are requested or 2) affinity is turned off.

@pfunzowalter
Copy link

Hi, I came across this issue as I was experiencing the similar concerns about DDF on slurm environment. I have created a custom script that can read correct Cores allocated for a slurm Job and update the DDF config file before running DDF. The script assumes that in the parallel section of the default DDF config "MainProcessAffinity = 0". Running this will then overide the value of the MainProcessAffinity with one of the core allocated for the slurm Job which will then provide an opportunity to use a subset of cores on a node.

import os
import subprocess
import psutil

logical_cpus = psutil.cpu_count(logical=True)
physical_cpus = psutil.cpu_count(logical=False)

print(f"Logical CPUs: {logical_cpus}")
print(f"Physical CPUs: {physical_cpus}")

print(f"checking affinity length: {len(psutil.Process().cpu_affinity())}")

process = psutil.Process()  # Current process
print(f"The list of CPU's a process can run on: {process.cpu_affinity()}")

cpu_affinity_list = process.cpu_affinity()

# Define the file name
filename = "xxx.parset"

# Define the target line content to search for and the replacement content
target_line = "MainProcessAffinity = 0"
replacement_line = f"MainProcessAffinity = {cpu_affinity_list[0]}"

# Read the file and replace the line if it matches the target
with open(filename, "r") as file:
    lines = file.readlines()

# Modify the line containing the target
with open(filename, "w") as file:
    for line in lines:
        # Check if the line contains the target and replace it
        if line.strip() == target_line:
            file.write(replacement_line + "\n")
        else:
            file.write(line)

My parallel section of the DDF config when testing this was

[Parallel]
NCPU = 32 
Affinity = 1 
MainProcessAffinity = 0 

@mhardcastle
Copy link
Collaborator

Yes, this is the workround if there's no fix within DDFacet itself.

We discussed this extensively over in https://github.com/cyriltasse/DDFacet/issues/657 some years ago but it seems that nobody ever implemented a sane default behaviour.

@mhardcastle mhardcastle reopened this Nov 24, 2024
@mhardcastle
Copy link
Collaborator

OK, I think this needs a better fix. I have been looking at this for Tim and the workround above, though it looks like it works, actually doesn't on my test system -- it forces everything to run on one core, the one specified by Parallel-MainProcessAffinity. Moreover, disabling affinity with --Parallel-Affinity=disabled doesn't work at all if the cores available are not in the range(0,ncpu), because the code still attempts to set worker affinity to a number in that range.

I've implemented some logic that (a) works correctly if Parallel-Affinity=disabled (or 0) and (b) if Parallel-Affinity=1 starts from psutil.Process().cpuaffinity() to get the available CPU list and then works things out accordingly if it finds that there are fewer CPUs available than psutil.cpu_count().

The big issue is that this still doesn't work with something that attempts to set the parent process affinity. We have at the moment in master in AsyncProcessPool.py

psutil.Process().cpu_affinity(range(self.ncpu) if not self.parent_affinity else [self.parent_affinity])

which does nothing if parent_affinity==0 (the default). As soon as this is set to a number other than zero, then it appears that all children inherit that affinity, even if they are explicitly set to have some other affinity (which they are). I don't understand this entirely, but I don't think the parent affinity code has ever worked as intended. And if the line to set parent affinity is commented out, everything else appears to work, though it's difficult to tell if the worker processes are indeed bound to the cores they're supposed to be bound to.

I am putting in a PR for my working code, please don't merge it yet but it shows what is needed to get this to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants