Skip to content

Commit

Permalink
switch overhead alert panels to filtered metric
Browse files Browse the repository at this point in the history
Tracking of our grafana alert panels uncovered a form of "unfair inflation" of our overhead percentages, given
- the tekton timestamps are rounded to seconds
- certain user defined pipelines could succeed very quickly, in a manner of seconds, which is much much faster than RHTAP pipelines

So we built single metrics for our two overheads which combine algorithms from the multipe metrics we were using before, but
also ignored pipelines which ended very quickly.

After monitoring for a few weeks, the results appear favorable, with these specifics:
- if the super fast pipelines had 0 overhead (i.e. 0 to 459 milliseconds), then when are ovehead over time were small with the original metrics that
included every pipelinerun, our filtered metric was slightly higher because they did not benefit from the averages being lowered with samples of 0
- if the super fast pipelines had any overhead (i.e. 500 milliseconds or greater), then when we saw larger than typical overheads with the original metrics
that included every pipelinerun, our filtered metrics produced lower overhead results
- and generally speaking, the improvements with the new metric were better in proportion to any degradations with the new metric
  • Loading branch information
gabemontero authored and Roming22 committed Nov 29, 2023
1 parent c43c6e7 commit 4ea3e61
Showing 1 changed file with 2 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@
{
"editorMode": "code",
"exemplar": true,
"expr": "(sum(increase(pipelinerun_duration_scheduled_seconds_sum{status='succeded'}[30m])) / sum(increase(pipelinerun_duration_scheduled_seconds_count{status='succeded'}[30m]))) / (sum(increase(tekton_pipelines_controller_pipelinerun_duration_seconds_sum{status='success'}[30m])) / sum(increase(tekton_pipelines_controller_pipelinerun_duration_seconds_count{status='success'}[30m])))",
"expr": "sum(increase(pipeline_service_schedule_overhead_percentage_sum{status='succeded'}[30m])) / sum(increase(pipeline_service_schedule_overhead_percentage_count{status='succeded'}[30m]))",
"format": "table",
"hide": false,
"instant": false,
Expand Down Expand Up @@ -212,7 +212,7 @@
"targets": [
{
"editorMode": "code",
"expr": "(sum(increase(pipelinerun_gap_between_taskruns_milliseconds_sum{status='succeded'}[30m])/1000) / sum(increase(pipelinerun_gap_between_taskruns_milliseconds_count{status='succeded'}[30m]))) / (sum(increase(tekton_pipelines_controller_pipelinerun_duration_seconds_sum{status='success'}[30m])) / sum(increase(tekton_pipelines_controller_pipelinerun_duration_seconds_count{status='success'}[30m])))",
"expr": "sum(increase(pipeline_service_execution_overhead_percentage_sum{status='succeded'}[30m])) / sum(increase(pipeline_service_execution_overhead_percentage_count{status='succeded'}[30m]))",
"legendFormat": "__auto",
"range": true,
"refId": "A"
Expand Down

0 comments on commit 4ea3e61

Please sign in to comment.