-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e/features/tracing: reduce flakiness of entire suite, capture technical debt #10409
Conversation
Issues linked to changelog: |
Visit the preview URL for this PR (updated for commit 620efd2): https://gloo-edge--pr10409-sh-span-name-flake-0wyedn1a.web.app (expires Wed, 04 Dec 2024 01:29:01 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 77c2b86e287749579b7ff9cadb81e099042ef677 |
cc @danehans @ryanrolds the tcproute tests (https://github.com/solo-io/gloo/blob/main/test/kubernetes/e2e/features/services/tcproute/suite.go) are very flakey as well, and they are configured in a similar way to the tracing tests. I wonder if they're flakey for the same reason that I address in this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for having good comments on the test yaml!
Description
There are a few changes to our tracing tests, each are outlined below.
Context
Incorrect Status Bug
We uncovered a bug in the Control Plane that leads to a degraded experience for users (ref: kgateway-dev#10293). This has always existed and was not a regression due to the recent work. However, by adding the e2e tests (good) we have introduced that degraded experience into our CI pipeline.
Ordinarily I would advocate for just fixing the underlying bug instead of massaging the tests to hide the bug. In this case, the fix for the bug is >1 day, and so we are prioritizing that work, and hiding the flakiness in the test.
DNS Cache Bug
I hit another issue when this test ran. Logs: https://github.com/solo-io/gloo/actions/runs/12039520748/job/33567498654
I made an assumption/guess about what is happening here, and made a best effort at working around it, and leaving a big comment explaining why.
README update
As part of this work, I wanted to test my changes locally using a previously released version. Our debugging guide did not demonstrate how to do this, so I updated it.
Interesting decisions
Testing steps
Follow along the e2e debugging guide that I updated. Here are the steps I took
Invoke the span tests once to show they pass
results:
invoke the span tests multiple times to show they don't flake
Notes for reviewers
Checklist: