Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For #152 - remove signal-hook dependency #163

Merged
merged 3 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 0 additions & 20 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 1 addition & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,5 @@ edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
subprocess = "0.2"
libc = "0.2"
signal-hook = { version = "0.3", default-features = false, features = [] }
subprocess = "0.2"
49 changes: 49 additions & 0 deletions src/interrupt.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
use std::sync::atomic::{AtomicBool, Ordering};

// Signal handling logic.
//
// Assuming no bugs, the interesting interrupt signals are SIGHUP, SIGTERM, SIGINT, and SIGQUIT. Of
// these, only SIGHUP and SIGTERM are really interesting because they are sent by the OS or by job
// control (and will often be followed by SIGKILL if not honored within some reasonable time);
// INT/QUIT are sent by a user in response to keyboard action and more typical during
// development/debugging.
//
// Call handle_interruptions() to establish handlers, then is_interrupted() to check whether signals
// have been received.

static INTERRUPTED: AtomicBool = AtomicBool::new(false);

extern "C" fn sonar_signal_handler(_: libc::c_int) {
INTERRUPTED.store(true, Ordering::Relaxed);
}

pub fn handle_interruptions() {
unsafe {
let nomask : libc::sigset_t = std::mem::zeroed();
let mut action = libc::sigaction {
sa_sigaction: sonar_signal_handler as usize,
sa_mask: nomask,
sa_flags: 0,
sa_restorer: None,
};
libc::sigaction(libc::SIGTERM, &mut action, std::ptr::null_mut());
libc::sigaction(libc::SIGHUP, &mut action, std::ptr::null_mut());
}
}

#[cfg(debug_assertions)]
pub fn is_interrupted() -> bool {
if std::env::var("SONARTEST_WAIT_INTERRUPT").is_ok() {
std::thread::sleep(std::time::Duration::new(10, 0));
}
let flag = INTERRUPTED.load(Ordering::Relaxed);
if flag {
println!("Interrupt flag was set!")
}
flag
}

#[cfg(not(debug_assertions))]
pub fn is_interrupted() -> bool {
INTERRUPTED.load(Ordering::Relaxed)
}
1 change: 1 addition & 0 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ mod batchless;
mod command;
mod gpu;
mod hostname;
mod interrupt;
mod jobs;
mod log;
mod nvidia;
Expand Down
49 changes: 18 additions & 31 deletions src/ps.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

use crate::amd;
use crate::hostname;
use crate::interrupt;
use crate::jobs;
use crate::log;
use crate::nvidia;
Expand All @@ -11,13 +12,8 @@ use crate::procfsapi;
use crate::util::{csv_quote,three_places};

use std::collections::{HashMap, HashSet};
use std::env;
use std::io::{self, Result, Write};
use std::path::PathBuf;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::thread;
use std::time;

// The GpuSet has three states:
//
Expand Down Expand Up @@ -177,26 +173,17 @@ pub fn create_snapshot(jobs: &mut dyn jobs::JobManager, opts: &PsOptions, timest
// perform the operation.
//
// However if a signal arrives in the middle of the operation and terminates the program the
// lock file may be left on disk. Assuming no bugs, the interesting signals are SIGHUP,
// SIGTERM, SIGINT, and SIGQUIT. Of these, only SIGHUP and SIGTERM are really interesting
// because they are sent by the OS or by job control (and will often be followed by SIGKILL if
// not honored within some reasonable time); INT/QUIT are sent by a user in response to keyboard
// action and more typical during development/debugging.
// lock file may be left on disk. Therefore some lightweight signal handling is desirable to
// trap signals and clean up orderly.
//
// So THE MAIN REASON TO HANDLE SIGNALS is to make sure we're not killed while we hold the lock
// file, during normal operation. To do that, we just need to ignore the signal.
// Additionally, if a signal is detected, we do not wish to start new operations, we can just
// skip them. Code therefore calls is_interrupted() at strategic points to check whether a
// signal was detected.
//
// But if a handled signal is received, it would be sensible to exit as quickly as possible and
// with minimal risk of hanging. And to do that, we record the signal and then check the flag
// often, and we avoid starting things if the flag is set, and we produce no output (if
// possible) once the flag becomes set, yet produce complete output if any output at all.
//
// There's no reason to limit the signal handler to the case when we have a lock file, the same
// logic can apply to both paths.
// Finally, there's no reason to limit the signal handler to the case when we have a lock file,
// the same logic can apply to both paths.

let interrupted = Arc::new(AtomicBool::new(false));
signal_hook::flag::register(signal_hook::consts::SIGTERM, Arc::clone(&interrupted)).unwrap();
signal_hook::flag::register(signal_hook::consts::SIGHUP, Arc::clone(&interrupted)).unwrap();
interrupt::handle_interruptions();

if let Some(ref dirname) = opts.lockdir {
let mut created = false;
Expand All @@ -208,7 +195,7 @@ pub fn create_snapshot(jobs: &mut dyn jobs::JobManager, opts: &PsOptions, timest
p.push(dirname);
p.push("sonar-lock.".to_string() + &hostname);

if interrupted.load(Ordering::Relaxed) {
if interrupt::is_interrupted() {
return;
}

Expand Down Expand Up @@ -237,13 +224,14 @@ pub fn create_snapshot(jobs: &mut dyn jobs::JobManager, opts: &PsOptions, timest
}

if !failed && !skip {
do_create_snapshot(jobs, opts, timestamp, &interrupted);
do_create_snapshot(jobs, opts, timestamp);

// Testing code: If we got the lockfile and produced a report, wait 10s after producing
// it while holding onto the lockfile. It is then possible to run sonar in that window
// while the lockfile is being held, to ensure the second process exits immediately.
#[cfg(debug_assertions)]
if std::env::var("SONARTEST_WAIT_LOCKFILE").is_ok() {
thread::sleep(time::Duration::new(10, 0));
std::thread::sleep(std::time::Duration::new(10, 0));
}
}

Expand All @@ -263,20 +251,19 @@ pub fn create_snapshot(jobs: &mut dyn jobs::JobManager, opts: &PsOptions, timest
log::error("Unable to properly manage or delete lockfile");
}
} else {
do_create_snapshot(jobs, opts, timestamp, &interrupted);
do_create_snapshot(jobs, opts, timestamp);
}
}

fn do_create_snapshot(
jobs: &mut dyn jobs::JobManager,
opts: &PsOptions,
timestamp: &str,
interrupted: &Arc<AtomicBool>,
) {
let no_gpus = empty_gpuset();
let mut proc_by_pid = ProcTable::new();

if interrupted.load(Ordering::Relaxed) {
if interrupt::is_interrupted() {
return;
}

Expand Down Expand Up @@ -331,7 +318,7 @@ fn do_create_snapshot(
); // gpu_mem_size_kib
}

if interrupted.load(Ordering::Relaxed) {
if interrupt::is_interrupted() {
return;
}

Expand Down Expand Up @@ -371,7 +358,7 @@ fn do_create_snapshot(
}
}

if interrupted.load(Ordering::Relaxed) {
if interrupt::is_interrupted() {
return;
}

Expand Down Expand Up @@ -416,7 +403,7 @@ fn do_create_snapshot(
}
}

if interrupted.load(Ordering::Relaxed) {
if interrupt::is_interrupted() {
return;
}

Expand Down
Loading