Releases: PanDAWMS/pilot3
3.3.1.9
- Clean fail of job approaching batch limit
- Grace time set to ten minutes. Pilot commits suicide if it has reached PQ.maxtime - 10 minutes
- Note: pilot option -l/–lifetime {value} supersedes the PQ.maxtime
- Requested by R. Walker
- Out of memory error identification improvement: Catch and report oom_kills by dmesg parsing
- For failed payloads, pilot now scans dmesg files for the corresponding payload subprocesses for out of memory errors
- Reported as (already existing) error 1212, “Payload ran out of memory”
- Requested by R. Walker
- Now sending job label 'unified' with getJob request on unified queues
- Currently using existing proxy
- Requested by R. Walker
- Fix for a problem identifying an almost expired proxy
- Previously, the pilot would report an expired proxy instead of downloading a new one
3.3.0.39
- Reported number of cores used by payload now calculated as (stime+utime)/walltime
- Requested by R. Walker
- Keeping the payload working directory for maxRSS errors
- Requested by R. Walker
- Added a proxy renewal mechanism
- When only 20 minutes, or less, remains of the proxy lifetime, the Pilot will attempt to download a new proxy from the server
- Requested by R. Walker
- Using LAN domain for stage-out file existence verification by Rucio
- Previously, WAN domain was used by default which could lead to problems
- Updated code for extracting the condor job id
- Attempting to bypass EPoll problem that e.g. leaves an error message seemingly in the class ad, which ends up in the batch id (it sneaks in via (at least) the usage of subprocess.get*output() functions)
- The origin of the problem was an XrootD issue described here and resolved here - in the meantime, while waiting for the new XrootD release, A. De Silva has reverted to a previous version
- Real-time logging
- Pilot now finds the proper log file for user jobs (tmp.stdout*) for the tails
- Removed davs protocol from list of allowed protocols for direct access
- Due to problems with different root versions on different sites
- Requested by R. Walker
- Changed internal lost heartbeat period from 6 to 3 h to match the server definition
- This is used to decide when it’s safe to kill all processes after the last successful server update
- Added protection against corrupted memory monitor info that could bring down the pilot
- Failure seen at SiGNET leading to strange aCT error info (pilot, 9000: E;r;r;o;r; ;r;e;a;d;i;n;g; ;u;s;e;r; ;g;e;n;e;r;a;t;e;d; ;o;u;t;p;u;t; ;f;i;l;e; ;l;i;s;t)
- Reported by G. Callea et al
- Bug fixes
- Fixed problem with reported expired proxy (pilot detected it but did not report it)
- Reported by R. Walker
- Pilot tried to contact server for getting the randomized server URL even though server updates were switched off
- Reported by M. Saito
- Fixed problem with parsing bad output from arcproxy, leading to an exception
- Reported by A. Filipcic
- Fix for a problem with lingering INDS env var from previous job
- Details here
- Reported by R. Walker
- Fixed problem with reported expired proxy (pilot detected it but did not report it)
Code contributions from P. Vokac, A. Anisenkov, P. Nilsson
3.2.4.16
- Real-time logging
- Updated catchall usage with pilot config variable
- Logging server info is now defined in pilot config
- PQ.catchall should still contain logging type; logging=logstash (possibly to be replaced by a new PQ field)
- Fix 1: Previously, RT logging thread died too soon if not immediately used, i.e. impossible to activate mechanism after job start. This fix incidentally also corrects a problem leading to too many log messages, which could lead to very large log files (reported by F. Barreiro)
- Fix 2: Resetting RT logger after job finishes, to avoid lingering logger in multi-job mode and allowing for RT logging in later multi-job
- Updated catchall usage with pilot config variable
- Added new pilot options
- –storage-url: If set, pilot uses this URL for downloading storage info
- –rucio-host: Used to set up the rucio client
- Requested by sPHENIX
- Note: these test exposed a problem with the rucio image used for containerized stage-in/out (it is outdated and will most likely be updated by Attila Krasznahorkay, in the absence of Thomas Beermann)
- Clean-up of obsolete code in mv copy tool
Code contributions from D. Cameron, P. Nilsson
3.2.3.27
- Pilot now randomizes PanDA server URL for get job operation
- Previously, only the update job used random URL (better to fail right away in case of DNS problem)
- Requested by R. Walker
- Real-time logging
- Debug mode can now be turned on during running
- 'tail name_of_known_logfile'-instructions can now be sent to the pilot using the sendCommandToJob.py server script (in combination with setDebugMode.py). Otherwise default payload.stdout is shown
- Bug fixes
- Protection against missing ‘requests’ Python module
- Problem seen on MareNostrum
- Also fixed locally by installing said module
- Reported by A. Pages
- Protection against missing logfileReport in job report
- Problem seen with NTUPMerge jobs
- Reported by R. Walker
- Protection against missing ‘requests’ Python module
3.2.2.22
- Real-time logging
- Added new HTTP transport class that supports user certificates, accepted by the logstash server (for possible future use)
- Using password from job definition (hidden from view) to communicate with logstash
- Real-time logging is available for user jobs with the –debugMode prun option
- Also enabled for production tasks via script setDebugMode
- Moved log tarball away from work directory, as it overwrote logs with the same name in the case of merge jobs
- Requested by R. Walker
- Pilot can now extract special Frontier errors from job report
- Requested by M. Vogel
- Discussed in JIRA ticket: https://its.cern.ch/jira/browse/ATLASJT-425
- Added new pilot error code 1369: Frontier error (pilot adds extracted error message or last normal line to error diagnostics)
- Again using old rucio option –rse if rucio version is too old (otherwise –rses for newer version)
- Used for objectstore transfers in event service mode
- Stager workflow added
- Minimal set of threads activated
- Pilot stages in data then quits
- For pod usage only
- Bug fixes
- Fixed an issue with the pandaserver url parsing in job.py
- Whenever you specify both the http protocol in --url and a port as CLI arguments ( ./pilot.py --url http://pandaserverurl -p 25080 ...) then the port gets ignored and updateJob requests are sent to http://pandaserverurl. The reason was that the prefix http:// is not removed and the conditional at Line 458 finds : and ignores the port
- Note that this was only affecting updateJob requests, getJob, UpdateEventRanges and other requests were using the correct URL
- When using --use-https=False there is an exception occurring in get_curl_command as it is still called and is using the _ctxobject however https_setup() was never called to initialize the module.
- Fixed problem with not reporting remote file open failures in traces
- Reported by R. Walker
- Fixed an issue with the pandaserver url parsing in job.py
Code contributions from A. Alekseev, J. Esseiva, P. Nilsson
3.2.1.1
- Patch for excessive log messages in looping job algorithm
- Reported by L. Bryant
3.2.0.28
- Removed curl config (used in updateJob operations) from list of pilot log files to avoid when locating latest updated log file (to tail in debug mode)
- Requested by R. Walker
- Added protection/handling for failed remaining disk space command
- Requested by R. Walker
- Looping job updates
- Only run looping job killer if enough time has passed since start
- I.e. if t_elapsed > t_looping_verification_time
- Looping job algorithm now only runs while payload is running
- Requested by R. Walker
- Fixed problem with looping jobs not finding any files on site using env var HOME beginning with /pilot (/pilotdir)
- Pilot ignores source files, located in directory ‘pilot’
- Reported by R. Taylor
- Only run looping job killer if enough time has passed since start
- AES updates
- Added a stage-out failure counter for AES jobs
- Pilot now aborts event service process if there are 20 stage-out failures
- Now supporting rucio client option –rses (instead of deprecated –rse) for rucio download
- Fixed problem with unbound error variable (could happen in case of stage-out failures)
- Reported by F. Barreiro
- Added a stage-out failure counter for AES jobs
- Now extracting critical failures from payload stdout and reporting this as ‘unknown transform failure’
- The reported error diagnostics will include the first line from the stdout containing ‘CRITICAL’
- Requested by R. Walker
- Prevented abortion of payload execution thread after receiving a server kill instruction
- The would lead to problems with executing a secondary job in multi-job mode (a required thread was missing leading eventually to time-outs and lost heartbeat)
- Reported by F. Luehring, R. Walker
- Appending PQ.environ key pairs as exports in payload execution command
- E.g. PQ.environ=’KEY1=VALUE1 KEY2=VALUE2 ..’ adds ‘export KEY1=VALUE1; export KEY2=VALUE2;’ (etc) before payload command
- Requested by R. Walker
3.1.1.10
- Skipping sourcing of atlasLocalSetup (as well as executing lsetup emi) when setting up arcproxy
- Command should already be available in the environment since wrapper sets it up
- Pilot wrapper has also been updated since atlasLocalSetup was set up twice (now only sourced in local setup)
- Added iterative function to determine a good fitting range for payload memory usage fittings
- Suggested by M. Maeno
- Removed workDir from list of items to check in looping test
- Requested by R. Walker
- Bug fixes
- Improved identification of zero size output files (could previously have been reported as file size too large!)
- Fixed rare exceptions with reading /proc/%d/stat. Affected a handful of jobs
3.1.0.63
-
Adler32 update
- Algorithm updated to be more robust and efficient with memory usage and memory IO
- Added new error code 1366: "Failure during checksum calculation" (from new exception handling)
-
Time-out added to remote file open script
- Must now complete within allowed time (currently 120 * number of files + 120 s)
- Removed previous 60 s time-out in queue handling which was not thread safe
- Already using a root native 120 s time-out for each TFile.Open() operation, but it seems this does not always work
-
Now possible to switch off workdir cleanup with server command 'nocleanup'
- Requested by R. Walker
-
Near real-time logging
- Logstash testing in development
-
Worker status
- Pilot is now sending info to the server about the worker status at the beginning and end of the pilot
- Requested by F. Barreiro
-
Fixed an issue with lingering error info from a previous job in multi-job mode, leading to failure with reporting job updates
- Requested by R. Walker, F. Luehring
-
Pilot now keeps track of elapsed time since last successful heartbeat
- In the case this time exceeds six hours, it means the pilot has not been able to contact the server which will have declared the job lost. Pilot then aborts abruptly
- Requested by F. Luehring
-
Reported work dir size update
- Now ignoring input files that reside on non-scratch disk (ie on NDGF and storm sites) when calculating work dir size
- This value gets reported by scout jobs and can of course cause mischief on other sites if not calculated correctly (job running out of local space since the job should not have been sent there in the first place)
- Reported by H. Severini, R. Walker
-
All pilot options now have accompanied descriptive alternatives (e.g. -d, --debug)
-
Always add env vars (Frontier ID etc) to container setup
- Previously not added for user containers
- Requested by N. Ozturk, A. De Silva
-
Internal improvements (incl. with pylint) towards native Python 3 compatibility (a lot Python 2 specific code has been removed/replaced)
Contributions from M. Lassnig, P. Nilsson