-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes for policystatemetrics #2285
Conversation
policystatemetrics needs a reference to the sensor manager so that it can collect metrics. Currently, this reference is passed using observer.GetSensorManager() at initialization time. In observer tests, we currently do not restart the metrics (see [1]) which means that if we create a new observer, then the metrics will still reference the old sensor manager. Fix this by having policystatemetrics to call observer.GetSensorManager() to get the latest version of the sensor manager. [1] https://github.com/cilium/tetragon/blob/22eb995b19207ac0ced2dd83950ec8e8aedd122d/pkg/observer/observertesthelper/observer_test_helper.go#L272-L276 Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
We should also do the same in the other operations, but we leave that as a followup. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
This patch adds a timeout for ListTracingPolicies. It can be the case that the sensor manager is stuck or misbehaving. This patch (combined with the previous one) ensures that metrics will continue after a timeout. Tested manually using: ```diff diff --git a/pkg/metrics/policystatemetrics/policystatemetrics_test.go b/pkg/metrics/policystatemetrics/policystatemetrics_test.go index 227306b65..fd581392b 100644 --- a/pkg/metrics/policystatemetrics/policystatemetrics_test.go +++ b/pkg/metrics/policystatemetrics/policystatemetrics_test.go @@ -9,6 +9,7 @@ import ( "io" "strings" "testing" + "time" "github.com/cilium/tetragon/pkg/observer" tus "github.com/cilium/tetragon/pkg/testutils/sensors" @@ -57,3 +58,22 @@ tetragon_tracingpolicy_loaded{state="load_error"} %d err = testutil.CollectAndCompare(collector, expectedMetrics(1, 0, 0, 0)) assert.NoError(t, err) } + +func TestTimeout(t *testing.T) { + reg := prometheus.NewRegistry() + + manager := tus.GetTestSensorManager(context.TODO(), t).Manager + observer.SetSensorManager(manager) + t.Cleanup(observer.ResetSensorManager) + + collector := newPolicyStateCollector() + reg.Register(collector) + + go func() { + err := manager.SleepForTesting(context.TODO(), t, 1*time.Second) + assert.NoError(t, err) + }() + + err := testutil.CollectAndCompare(collector, strings.NewReader("")) + assert.NoError(t, err) +} diff --git a/pkg/sensors/manager.go b/pkg/sensors/manager.go index eaf908340..291a58c8f 100644 --- a/pkg/sensors/manager.go +++ b/pkg/sensors/manager.go @@ -8,6 +8,8 @@ import ( "errors" "fmt" "strings" + "testing" + "time" "github.com/cilium/tetragon/api/v1/tetragon" "github.com/cilium/tetragon/pkg/k8s/apis/cilium.io/v1alpha1" @@ -96,6 +98,13 @@ func startSensorManager( logger.GetLogger().Debugf("stopping sensor controller...") done = true err = nil + + // NB(kkourt): for testing + case *sensorManagerSleep: + time.Sleep(op.d) + err = nil + default: err = fmt.Errorf("unknown sensorOp: %v", op) } @@ -421,6 +430,13 @@ type sensorCtlStop struct { retChan chan error } +// sensorManagerSleep just sleeps. Intended only for testing. +type sensorManagerSleep struct { + ctx context.Context + retChan chan error + d time.Duration +} + type LoadArg struct{} type UnloadArg = LoadArg @@ -436,5 +452,18 @@ func (s *sensorEnable) sensorOpDone(e error) { s.retChan <- e } func (s *sensorDisable) sensorOpDone(e error) { s.retChan <- e } func (s *sensorList) sensorOpDone(e error) { s.retChan <- e } func (s *sensorCtlStop) sensorOpDone(e error) { s.retChan <- e } +func (s *sensorManagerSleep) sensorOpDone(e error) { s.retChan <- e } type sensorCtlHandle = chan<- sensorOp + +func (h *Manager) SleepForTesting(ctx context.Context, t *testing.T, d time.Duration) error { + retc := make(chan error) + op := &sensorManagerSleep{ + ctx: ctx, + retChan: retc, + d: d, + } + + h.sensorCtl <- op + return <-retc +} ``` Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
5dc1065
to
5ce0f71
Compare
I have #2284 that reliably hangs thanks to this test failing on arm64, https://github.com/cilium/tetragon/actions/runs/8523996161/job/23347576591?pr=2284 I could try to rebase on top of this and see if it fixes my issue! Let see if this works for my issue #2286. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's amazing it actually fixes #2210. Let's merge this!
No description provided.