Releases: PanDAWMS/pilot3
3.6.6.22
- Improved termination and cleanup of prmon processes, which should eliminate any lingering zombie/defunct processes
3.6.5.32
- Measures against problems with lingering defunct processes
- Added internal timeout of 300s to curl call (on top of existing connection and max time timeouts)
- Added zombie reaper as part of job monitoring, executed at the time of the looping job check, i.e. every ten minutes
- Problems seen at least at MWT2 with large amount of lingering prmon processes and work directories
- Prmon has a suspected problem with being killed by SIGUSR1, often leaving it in a defunct state. Pilot is waiting for process to end normally, but this mostly fails. Being discussed with prmon developers
- Problem appears to have started around July 29 (unclear why), but lingering prmon processes seem to be everywhere. These defunct processes will normally go away when the top bash process is ended, but can remain lingering if there are hard kills of parent processes. If there are too many defunct processes, the batch system may kill the parent without warning which will result in lingering work directories, which in turn requires external cleanup
- Python 3.11 tests
- ALRB setup now supports Python 3.11 (A. De Silva)
- Tested successfully manually/interactively on Lxplus9/Alma9
- I.e. including rucio stage-in (Rucio stage-out fails as normally since I don’t have permission to write to the SE while running interactively / ie I used my own proxy)
- Logstash 2.3.0 and 2.5.0 tested successfully for real-time logging
- Also tested new logstash version 2.5.0 on CentOS7, works fine
- Added Python 3.11 to flake8 and unit tests
- Irrelevant ‘warning’ from lsetup cpu_flags ignored by the pilot (would otherwise lead to failure by pilot to interpret cpu_arch output)
- Reduced number of ps command calls
- Pilot uses ps to get info about processes in various situations which can be heavy on the system when there are several pilots running simultaneously
- Cached ps output when collecting child pids
- Removed several ps calls
- Note: there are quite a few ps calls during a long running job since the output is needed for the CPU consumption reporting - this will soon be addressed as well as A. De Silva has made the psutil module available (related pilot development is pending a wrapper update)
- Requested by J. Templon
- Moved multiprocessing module import to where it is used (instead of the top of the module)
- To prevent it from causing possible deadlocks in importing esp. Google Cloud Logging modules, which have known problems with multiprocessing
- This change is most relevant for Rubin since they have seen locking behavior when importing gcloud modules
- Change done to job control module (there is also usage in timer module)
- No output file verification for Raythena jobs since the final job report will not be known by the pilot (Raythena will handle it)
- Requested by J. Esseiva
- https://its.cern.ch/jira/browse/ATLASAMI-316
- Added size based time-out to log file creation
- Based on the size of the work directory (min timeout set to 90s, max 3h)
- New error code 1376, "Log file creation timed out"
- Requested by X. Zhao (sPHENIX) but change is relevant also for ATLAS
- Main command execute function updates
- Now thread safe
- An strace from MWT2 provided by F. Luehring indicated a thread lock in the execute function
- Always use a timeout on command calls
- A ridiculously long timeout is better than nothing since it will force subprocess python code to flush the stdout buffer which otherwise can be a problem on nodes with a huge number of cores
- Congested stdout buffers can lead to hanging
- Requested by W. Guan
- Now thread safe
- Truncating WARNING field in job report if too large
- Report includes the first 25 warnings
- Original report is backed up and kept in the log
- Requested by R. Walker
- Real-time logging update
- rtlogging field is now experiment specific (used to define RT server)
- Requested by X. Zhao (sPHENIX), change is transparent for other experiments
- Updated encoded HTCondor env var
- Requested by X. Zhao (sPHENIX)
- Object store updates (Rubin)
- add env to be able to define different AWS_PROFILE: For Rubin, multiple objectstore can be used (pilot is using one and the other Rubin payload is using another one). The AWS_PROFILE can be used to select different credentials for authentication.
- add copy_out_extend function: By default, pilot will put all logs files in a tar file and copy out this tar file. For Rubin, we need to copy different log files separately without putting them into a tar file. So here I added to environment variable whether to use this copy_out_extend function.
- fix upload_files to be able to use different endpoint and bucket name
- PanDA/Dask integration related changes
- Pilot now keeps job in running state until lease time is up
- Interactive job can now be aborted by user
Contributions from W. Guan, P. Nilsson
3.6.4.7
-
Urgent patch for problem with job monitor loop leading to lost heartbeats seen in RC prod jobs (not seen in shorter running dev jobs)
-
Other changes in this version are related to unfinished Dask tests
3.6.3.8
- Updates to location detection for replicas
- Misleading error message seen on ANALY_FZK_VP after a requests.post time-out that triggered a secondary exception due to a misused class decorator
- Increased time-out from 1s to 10s when contacting location.cern.workers.dev service (median response time is 1.6 ms)
- No failure is set if location detection fails, warning only written to log
3.6.2.9
- Added longitude and latitude info in replica lookup for VP sites
- Requested by I. Vukotic
- Removed outdated option “--alg gcc” from cpu_arch command
- The external script was updated in the meantime which broke the usage in the pilot
- Requested by J. Elmsheuser
- Remote file open time-outs
- Added missing init of signal variable, which lead to trouble on CA-VICTORIA-K8S-T2
- Reference RC dev job on similar queue LRZ-LMU_K8S: https://bigpanda.cern.ch/job?pandaid=5900134141
- Requested by R. Taylor
- Housekeeping
- Added .pylintrc to distribution, in preparation for Pylint GitHub Action
- Pylint related updates on five modules
- activemq (9.5/10)
- auxiliary (9.8/10)
- common (10/10)
- config (10/10)
- pilot (9.9/10)
3.6.1.31
-
Prevented CPU architecture script from being executed when not wanted (no change for ATLAS)
- Seems to occasionally cause hanging on Rubin resources
-
Time-outs and remote file open verification
- Enforcing a stdout buffer flush in remote file open script (always) as well as when receiving a time-out exception in script execution. It might help in some cases, but not if it’s the container setup that is hanging
- Requested by R. Walker
- Writing all messages from remote file open script to new text file, “remotefileslog-instant.txt” - as opposed to only creating this file using stdout after the container has finished
- Any time-out info will be written to “remote_open.std*” files
- Container setup and/or time-out exception will be written to “remotefileslog.txt” as before
- (A later pilot version can extract the last file open message from this file and add to the error diagnostics)
- Fix for recursive kills after time-out (leading to many kill attempts)
- Requested by R. Taylor
- Enforcing a stdout buffer flush in remote file open script (always) as well as when receiving a time-out exception in script execution. It might help in some cases, but not if it’s the container setup that is hanging
-
Added support for output file with regular expression
- Pilot looks for matching files when it finds ‘regex|..’ expression in LFN and updates output file list
- Requested by T. Maeno, J. Webb and X. Zhao for sPHENIX
-
Added time-out to ps execution (for CPU activity monitoring) since Rubin reported that this operation can hang on problematic nodes
-
Only show internal memory usage in debug mode
- To prevent excessive calls to ps command
- Requested by R. Walker
-
Pilot changes related to PanDA/Dask integration (interactive mode)
- Updated and improved logic for stage-in when pilot is running in a pod in stager mode
- Pilot stages in input files then quits
- Currently this leads to job finishing even though the user will still be using jupyter on the resource (a later pilot version can keep the job in running state until end of lease time)
-
Checksum type can now be selected in pilot config
- container_type=md5 or adler32
- Requested by J. Webb (BNL) for sPHENIX
-
Preserving file attributes (timestamps, mode, ownership) while copying pilot source into container (A.A.)
-
VP jobs now using ignore_availability=False when looking up replicas (I.V.)
- To bypass replica sorting issue seen in VP jobs where algorithm picked replica from site that was in downtime
-
Housekeeping
- Processed multiple files with pylint and implemented solutions (typical scores: 7-9+ / 10)
Code contributions from A. Anisenkov, I. Vukotic, P. Nilsson.
3.6.0.108
- Pilot now executes script cpu_arch.py (option –alg gcc) that reports the CPU architecture
- Requested by A. Serhan Mete
- For details, see https://its.cern.ch/jira/browse/ATLINFR-4844
- For non-ATLAS experiments, pilot is using internal cpu_arch script, and lsetup cpu_flags for ATLAS
- Resilience against slow networks
- Problem seen with Rubin job where they had severe network issues (a jobUpdate was very slow to finish, and once it did, the actual job had already finished and this lead to problems with secondary job)
- Now making sure that job workdir actually still exists when pilot receives a ‘tobekilled’ instruction - in order to prevent total abort; ‘tobekilled’ will also no longer lead to pilot ending
- Zipping all oversized files (typically payload stdout or other log files created by the payload)
- Previously, pilot deleted these files
- Size of archive also checked, deleted if too big
- Requested by R. Walker
- Immediate server update after batch kill
- Requested by R. Walker
- Use job.maxwalltime if available instead of PQ.maxwalltime (push queues only)
- Requested by R. Walker
- Redirecting stdout/stderr from remote file open command to files to avoid lost output in case of time-out exception
- Requested by R. Walker
- Pilot now sleeps two minutes (configurable) between PanDA server updates in case of trouble
- Requested by W. Guan
- HTCondor environmental variable
- Now setting new env var HTCondor_JOB_ID for debugging purposes with the following format
< PanDA ID > : < processing type > : < cluster ID > . < process ID > _ < schedd name code > - Due to a max allowed length of 31 chars, the cluster ID and process IDs are converted to hex
- Pilot enforces the max length
- Lustre has the ability to tag the JobID for monitoring purposes. The new variable is defined before any Lustre activity starts
- Requested by D. Benjamin for sPHENIX, but could be useful on all HTCondor systems
- Now setting new env var HTCondor_JOB_ID for debugging purposes with the following format
- Allowing mv copytool to move files to final destination
- Activated via PQ.catchall=..,mv_final_destination
- Requested by D. Benjamin
- Dask updates
- Pilot has been tested running in a pod for dask purposes, both in interactive mode (pilot communicates with server and stages in files if necessary) and non-interactive mode (pilot runs on resource like a normal grid job)
- AlmaLinux9 related update
- Dumping /etc/os-release to log instead of trying to execute lsb_release command, which is not available on AlmaLinux9
- Requested by J. Van Eldik
- Added memory monitoring for sPHENIX
- Based on prmon and same setup as ATLAS is using (but with hardcoded path instead of using asetup)
- Requested by X. Zhao (BNL)
- Raythena updates
- Renamed internal resource from Cori to Nersc since Cori is reaching end of life and transitioning to Perlmutter
- Updated FRONTIER_SERVER to use Nersc proxy
- Fixed an issue when trying to append --preExec to the executable
- Correctly configured event service job with CA
- GitHub PR: #79
- Internal thread handling and job monitoring improved to catch rare failures
- It was reported by Rubin that the pilot could get stuck in some rare cases and not be able to finish
- Pilot is waiting for all internal threads (except main thread) to finish after graceful_stop has been set, but has a five minute time-out in case some thread is stuck
- Pilot checks now optional
- Currently the following config options are now optional: last_heartbeat,machinefeatures,jobfeatures,cpu_usage,threads
- If not present in config.Pilot.checks, pilot will not run the corresponding check
- If config file is outdated / Pilot.config is not listed, all checks will run as before
- More checks to follow, including Payload.checks
- Requested by X. Zhao (BNL)
- Real-time logging
- ssl_enable and ssl_verify are now configurable (ssl_enable=True will trigger https transport (default), and http transport for ssl_enable_False)
- Requested for sPHENIX by X. Zhao (BNL)
- Bug fixes
- Fixed import problem in gs copy tool
- Requested by W. Guan
- Improved process and process group killing after command execution timeout
- Previously, it could happen that a process lookup after a timeout could trigger a second exception after the initial timeout exception
- Fixed import problem in gs copy tool
Contributions from J. Esseiva, P. Nilsson.
3.5.1.17
- Now using shutdowntime from machine features when set (in addition to existing PQ.maxtime method)
- PIlot brings down the job if approaching the shutdowntime (grace time set to ten minutes)
- Requested by R. Walker
- Optimized gs copy tool
- Previously, the storage client was initialized for each file transfer. Now it is initialized only once
- Requested by Zhaoyu Yang (Rubin)
- Bug fixes:
- Adjusted remote file open time-out so that it only contains direct i/o files
- Previously non-direct i/o files were also taken into consideration
- Making sure that hs06 is only reported if total_cpu can be read
- Fixed case where number of cores in cpuinfo string was reported as %d-Core. Also, now using lscpu command to extract number of cores instead of summing up /proc/cpuinfo
- Adjusted remote file open time-out so that it only contains direct i/o files
3.5.0.31
-
Improved remote file open time-outs
- Now using os.kill() function to terminate script instead of relying on python subprocess time-outs, which did not work reliably in this case (it fails to abort a running script)
- Requested by R. Walker
-
Added number of cores string to cpuconsumptionunit in case it is missing
- Requested by Jyoti Prakash Biswal
-
Removed escape character from Google TURLs previously thought to be necessary
- Requested by J. Elmsheuser
-
Now supporting machine and job features if published by a site
- Currently pilot is (only) reporting hs06 (scaled with total CPU and core count) with job metrics
- Requested by R. Walker
-
Explicitly using DN ‘atlpilo2’ for VOMS role ‘atlas’
- Until now, a unified pilot has been submitted with ‘atlpilo1’ and the production role. To improve security, the plan is to instead use the Atlas Pilot2 robot proxy, with the production role removed
- Requested by R. Walker
-
Dumping arcproxy -I subject to log after proxy download
- Requested by R. Walker
-
Dumping more file information (file names and their mod times of recently updated files) before killing looping job
- Requested by Rubin people [also for ATLAS jobs]
-
Added job.RequestID to real-time logging message
- Requested by Zhaoyu Yang (Rubin), https://its.cern.ch/jira/browse/ATLASPANDA-785 [also for ATLAS jobs]
-
Updated GitHub Actions which had a deprecation warning after a GitHub pull request
- Workflows now use node.js 16 dependent actions
- https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/
Contributions from B. Simmons, P. Nilsson.
3.4.8.5
- Removed bad character in curl command, causing trace reporting to fail
- Missed since curl still returned zero exit code. Pilot now looks for exceptions in curl stdout as well
- Reported by Fred Luehring
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-697
- Now supporting new env vars OICD_AUTH_* when OICD tokens are used
- PANDA_AUTH_* kept for backward compatibility
- Discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-775
- Function added to be used for reporting CPU flags in a later pilot version