We aren't monitoring for ansible-runner failures #224

gandelman-a · 2017-04-20T19:17:13Z

Our deployment ansibles get run asynchronously via cron. We are currently not monitoring them in any way. If one is failing, it usually doesn't get noticed until one of us bothers to look. We should output something from ansible-runner tasks that indicate failure, and pick up on that via a datadog monitor. This can be something as simple as dropping a file for the failing environment somewhere on the filesystem, and a datadog monitor that triggers when said files exist?

This updates ansible-runner to drop empty files in a known directory for environments that are failing to complete their playbooks. The files are named after the failing environment. Upon successful ansible run, these files are cleaned up if they exist. This'll allow us to set a datadog check that fails when any files exist here. Our current datadog check is broken and relies on log scraping and doesn't really work with multi-env bastions. Related-Issue: BonnyCI/projman#224 Signed-off-by: Adam Gandelman <adamg@ubuntu.com>

This adds a simple datadog check that fails if it finds any flag files for failing ansible-runner environments. This depends on PR BonnyCI#363 but should pass OK if it lands before that merges. Related-Issue: BonnyCI/projman#224 Signed-off-by: Adam Gandelman <adamg@ubuntu.com>

gandelman-a added help wanted Monitoring Operations labels Apr 20, 2017

gandelman-a self-assigned this Apr 25, 2017

gandelman-a removed the help wanted label Apr 25, 2017

gandelman-a mentioned this issue Apr 25, 2017

ansible-runner: Drop flag files for failing environments BonnyCI/hoist#363

Merged

gandelman-a mentioned this issue Apr 25, 2017

Adds dd-ansible-runner check and role BonnyCI/hoist#364

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We aren't monitoring for ansible-runner failures #224

We aren't monitoring for ansible-runner failures #224

gandelman-a commented Apr 20, 2017

We aren't monitoring for ansible-runner failures #224

We aren't monitoring for ansible-runner failures #224

Comments

gandelman-a commented Apr 20, 2017