Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e/features/tracing: reduce flakiness of entire suite, capture technical debt #10409

Merged
merged 9 commits into from
Nov 27, 2024

Conversation

sam-heilbron
Copy link

@sam-heilbron sam-heilbron commented Nov 26, 2024

Description

There are a few changes to our tracing tests, each are outlined below.

Context

Incorrect Status Bug

We uncovered a bug in the Control Plane that leads to a degraded experience for users (ref: kgateway-dev#10293). This has always existed and was not a regression due to the recent work. However, by adding the e2e tests (good) we have introduced that degraded experience into our CI pipeline.
Ordinarily I would advocate for just fixing the underlying bug instead of massaging the tests to hide the bug. In this case, the fix for the bug is >1 day, and so we are prioritizing that work, and hiding the flakiness in the test.

DNS Cache Bug

I hit another issue when this test ran. Logs: https://github.com/solo-io/gloo/actions/runs/12039520748/job/33567498654

stdout:

stderr:*   Trying 10.96.181.139:18080...
* Connection timed out after 3000 milliseconds
* Closing connection 0
* Hostname gateway-proxy-tracing.gloo-gateway-edge-test.svc.cluster.local was found in DNS cache
*   Trying 10.96.181.139:18080...
* Connection timed out after 3001 milliseconds
* Closing connection 1
* Hostname gateway-proxy-tracing.gloo-gateway-edge-test.svc.cluster.local was found in DNS cache
*   Trying 10.96.181.139:18080...
* Connection timed out after 3001 milliseconds
* Closing connection 2
command terminated with exit code 28

I made an assumption/guess about what is happening here, and made a best effort at working around it, and leaving a big comment explaining why.

README update

As part of this work, I wanted to test my changes locally using a previously released version. Our debugging guide did not demonstrate how to do this, so I updated it.

Interesting decisions

Testing steps

Follow along the e2e debugging guide that I updated. Here are the steps I took

kind create cluster
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml

Invoke the span tests once to show they pass

RELEASED_VERSION=1.18.0-rc2 CLUSTER_NAME=kind go test -v -timeout 600s ./test/kubernetes/e2e/tests -run ^TestGlooGatewayEdgeGateway$/^Tracing$

results:

 Gloo was successfully uninstalled.
--- PASS: TestGlooGatewayEdgeGateway (180.42s)
    --- PASS: TestGlooGatewayEdgeGateway/Tracing (100.39s)
        --- PASS: TestGlooGatewayEdgeGateway/Tracing/TestSpanNameTransformationsWithRouteDecorator (17.39s)
        --- PASS: TestGlooGatewayEdgeGateway/Tracing/TestSpanNameTransformationsWithoutRouteDecorator (31.09s)

invoke the span tests multiple times to show they don't flake

 RELEASED_VERSION=1.18.0-rc2 CLUSTER_NAME=kind go test -v -count=10 -timeout 600s ./test/kubernetes/e2e/tests -run ^TestGlooGatewayEdgeGateway$/^Tracing$

Notes for reviewers

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@solo-changelog-bot
Copy link

Issues linked to changelog:
#10365
kgateway-dev#10327
kgateway-dev#10293

Copy link

github-actions bot commented Nov 26, 2024

Visit the preview URL for this PR (updated for commit 620efd2):

https://gloo-edge--pr10409-sh-span-name-flake-0wyedn1a.web.app

(expires Wed, 04 Dec 2024 01:29:01 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@sam-heilbron
Copy link
Author

cc @danehans @ryanrolds the tcproute tests (https://github.com/solo-io/gloo/blob/main/test/kubernetes/e2e/features/services/tcproute/suite.go) are very flakey as well, and they are configured in a similar way to the tracing tests. I wonder if they're flakey for the same reason that I address in this PR

Copy link
Collaborator

@nfuden nfuden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having good comments on the test yaml!

@soloio-bulldozer soloio-bulldozer bot merged commit 5b41985 into main Nov 27, 2024
19 checks passed
@soloio-bulldozer soloio-bulldozer bot deleted the sh/span-name-flake branch November 27, 2024 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants