Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notification channel support in Slack #51

Merged
merged 6 commits into from
Mar 24, 2025

Conversation

jonathanio
Copy link
Contributor

Add support to Alertmanager for a new notifications channel alongside the infra and applications alerts channels. Selected Alerts, such as Pods passing memory or CPU requests or other unusual or informational alerts that should probably be actioned but do not directly affect normal operational stability, are now redirected there.

Checklist

Please check and confirm the following items have been performed, where
possible, for this Pull Request:

  • I have performed a self-review of my code and run any tests locally to check.
  • I have added tests that prove my changes are effective and work correctly.
  • I have made corresponding changes to the documentation as needed.
  • Each commit in, and this pull request, have meaningful subject & body for context.
  • I have added type/..., changes/..., and 'release/...' labels, as needed.

Refactor the templates for PagerDuty and Slack to combine common
elements into a single file, and update the Slack and PagerDuty
receivers to use less local templating and use more defined templates.

Add improved link generation for Alertmanager, including:

- Improve the filter generation for the Alertmanger link to ensure that
  silenced Alerts are included in the search;
- Create a new "all" silence link for Alertmanager which just uses the
  labels used for grouping as the filter, allowing easy access to
  silencing all related alerts without information about the metrics
  endpoint, Prometheus, and the alert configuration being included; and
- Update the normal slience link to focus on only adding specific
  labels, if they exist, to the filter, such as information about the
  Pod or container, Node, and severity as so to make it less likly that
  the silence override will fail if there are changes to the metrics or
  Prometheus.

Additionally, expand the possible links in the Slack message, using some
of these new links and improve the naming and messaging around them too.
Add support for a new notifications chennel in the Slack provider
configuration for Alertmanager which allows some Alerts to be sent there
where they do not need to trigger an action directly, but may mean
something needs adjusting.

This can include information about Pods exceeding the memory or CPU
requests as they are not exceeding the limits, and so won't be throttled
or terminated. However, typical usage is now above the guaranteed
amount, and the amount considered by the by the scheduler if the Pod is
moved to another host, creating the potential for unstable operation.
Fix the initial sum_over_time() time-range for the
PrometheusTSDBSeriesIncreaseNotice Alert as this was for three hours
when the alert should be monitoring over an hours window for three
hours.
Update some Prometheus and ElasticSearch alerts to use the new
notifications channel in Slack to report on things happening but which
are not directly an Alert, but useful information or signed that
something may need to be fixed soon.
Fix the accidental setting of duplicate keys for cainjector in the
cert-manager values file.
Fix the PodNotReady check to exclude the Succeeded phase for the Pods as
this is a valid and health state and does not need to be alerted on.
@jonathanio jonathanio added priority/normal This is a normal-priority issue or pull request release/feature A new feature is added with this pull request type/refactoring A refactoring of existing code update/flux Update with improvements to the Flux Kustomizations labels Mar 24, 2025
@jonathanio jonathanio self-assigned this Mar 24, 2025
@jonathanio jonathanio merged commit 8ff90fa into main Mar 24, 2025
5 checks passed
@jonathanio jonathanio deleted the add-notification-channel-support branch March 24, 2025 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/normal This is a normal-priority issue or pull request release/feature A new feature is added with this pull request type/refactoring A refactoring of existing code update/flux Update with improvements to the Flux Kustomizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant