Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestGlooGatewayEdgeGateway/Tracing/TestSpanNameTransformationsWithoutRouteDecorator #10327

Closed
ashishb-solo opened this issue Nov 12, 2024 · 6 comments
Assignees
Labels
Prioritized Indicating issue prioritized to be worked on in RFE stream release/1.18 Type: Bug Something isn't working Type: CI Test Flake A non-deterministic test that slows down development

Comments

@ashishb-solo
Copy link
Contributor

ashishb-solo commented Nov 12, 2024

Which tests failed?

Which tests failed?

Recently added test flaked on its first few nightly runs

https://github.com/solo-io/gloo/actions/runs/11773028593/job/32789301648

The test seems to flake on TestSpanNameTransformationsWithoutRouteDecorator only. When we call AssertEventuallyConsistentCurlResponse, everything runs as expected until we reach this assertion block - then the tests seem to timeout. My suspicion is that the connection timeout is too aggressive (3 seconds).

I also added this block to pause the test when the assertion failed:

diff --git a/test/kubernetes/testutils/assertions/curl.go b/test/kubernetes/testutils/assertions/curl.go
index ffb3f0362c..98aba0ad82 100644
--- a/test/kubernetes/testutils/assertions/curl.go
+++ b/test/kubernetes/testutils/assertions/curl.go
@@ -137,7 +137,11 @@ func (p *Provider) AssertEventuallyConsistentCurlResponse(
                WithTimeout(pollTimeout).
                WithPolling(pollInterval).
                WithContext(ctx).
-               Should(Succeed())
+               Should(Succeed(), func() string {
+                       fmt.Printf("test failed, do your diagnostics\n")
+                       time.Sleep(5 * time.Hour)
+                       return ""
+       })
 }
 
 // AssertEventualCurlError asserts that the response from a curl command is an error such as `Failed to connect`

And ran my own curl requests against the server:

kubectl exec --container=curl curl -n curl -- curl --connect-timeout 1 --max-time 5 -v -s --connect-timeout 3 --max-time 3 -H Host:\ test-really-cool-hostname.com http://gateway-proxy-tracing.gloo-gateway-edge-test.svc.cluster.local:18080/path/without/route/descriptor

And those requests completed fine, so I'm pretty sure that the cluster is in a good state, and that the timeout is transient.

Initial Investigation

2024-11-11T06:29:22.5102809Z                                  Dload  Upload   Total   Spent    Left  Speed
2024-11-11T06:29:22.5103101Z 
2024-11-11T06:29:22.5103476Z   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.96.15.122:18080...
2024-11-11T06:29:22.5104360Z * Connected to gateway-proxy-tracing.gloo-gateway-edge-test.svc.cluster.local (10.96.15.122) port 18080 (#0)
2024-11-11T06:29:22.5105063Z > GET /path/without/route/descriptor HTTP/1.1
2024-11-11T06:29:22.5105478Z > Host: test-really-cool-hostname.com
2024-11-11T06:29:22.5105856Z > User-Agent: curl/7.83.1-DEV
2024-11-11T06:29:22.5106130Z > Accept: */*
2024-11-11T06:29:22.5106339Z > 
2024-11-11T06:29:22.5106577Z * Mark bundle as not supporting multiuse
2024-11-11T06:29:22.5106905Z < HTTP/1.1 200 OK
2024-11-11T06:29:22.5107176Z < content-type: text/plain
2024-11-11T06:29:22.5107481Z < date: Mon, 11 Nov 2024 06:29:22 GMT
2024-11-11T06:29:22.5107821Z < content-length: 369
2024-11-11T06:29:22.5108138Z < x-envoy-upstream-service-time: 0
2024-11-11T06:29:22.5108445Z < server: envoy
2024-11-11T06:29:22.5108661Z < 
2024-11-11T06:29:22.5108860Z { [369 bytes data]
2024-11-11T06:29:22.5109005Z 
2024-11-11T06:29:22.5109290Z 100   369  100   369    0     0   111k      0 --:--:-- --:--:-- --:--:--  120k
2024-11-11T06:29:22.5110091Z * Connection #0 to host gateway-proxy-tracing.gloo-gateway-edge-test.svc.cluster.local left intact
2024-11-11T06:29:22.5110612Z 
2024-11-11T06:29:22.5110617Z 
2024-11-11T06:29:22.5110964Z success: &{StdErr:  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2024-11-11T06:29:22.5111928Z                                  Dload  Upload   Total   Spent    Left  Speed
2024-11-11T06:29:22.5112228Z 
2024-11-11T06:29:22.5112583Z   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.96.15.122:18080...
2024-11-11T06:29:22.5113452Z * Connected to gateway-proxy-tracing.gloo-gateway-edge-test.svc.cluster.local (10.96.15.122) port 18080 (#0)
2024-11-11T06:29:22.5114148Z > GET /path/without/route/descriptor HTTP/1.1
2024-11-11T06:29:22.5114556Z > Host: test-really-cool-hostname.com
2024-11-11T06:29:22.5114916Z > User-Agent: curl/7.83.1-DEV
2024-11-11T06:29:22.5115192Z > Accept: */*
2024-11-11T06:29:22.5115393Z > 
2024-11-11T06:29:22.5115629Z * Mark bundle as not supporting multiuse
2024-11-11T06:29:22.5115957Z < HTTP/1.1 200 OK
2024-11-11T06:29:22.5116236Z < content-type: text/plain
2024-11-11T06:29:22.5116523Z < date: Mon, 11 Nov 2024 06:29:22 GMT
2024-11-11T06:29:22.5116859Z < content-length: 369
2024-11-11T06:29:22.5118003Z < x-envoy-upstream-service-time: 0
2024-11-11T06:29:22.5118469Z < server: envoy
2024-11-11T06:29:22.5118691Z < 
2024-11-11T06:29:22.5119119Z { [369 bytes data]
2024-11-11T06:29:22.5119284Z 
2024-11-11T06:29:22.5119596Z 100   369  100   369    0     0   111k      0 --:--:-- --:--:-- --:--:--  120k
2024-11-11T06:29:22.5120402Z * Connection #0 to host gateway-proxy-tracing.gloo-gateway-edge-test.svc.cluster.local left intact
2024-11-11T06:29:22.5121088Z  StdOut:Request served by http-echo
2024-11-11T06:29:22.5121313Z 
2024-11-11T06:29:22.5121452Z GET /path/without/route/descriptor HTTP/1.1
2024-11-11T06:29:22.5121699Z 
2024-11-11T06:29:22.5121862Z Host: test-really-cool-hostname.com
2024-11-11T06:29:22.5122173Z Accept: */*
2024-11-11T06:29:22.5122392Z Test_header: test_value
2024-11-11T06:29:22.5122857Z Traceparent: 00-b8896251419312733d177ad885bd5ed0-51b7a2aa21af12a0-01
2024-11-11T06:29:22.5123293Z Tracestate: 
2024-11-11T06:29:22.5123555Z User-Agent: curl/7.83.1-DEV
2024-11-11T06:29:22.5123900Z X-Envoy-Expected-Rq-Timeout-Ms: 15000
2024-11-11T06:29:22.5124426Z X-Forwarded-Proto: http
2024-11-11T06:29:22.5124818Z X-Request-Id: f6cdec11-f033-9c7f-a703-f2afb2a2adf6
2024-11-11T06:29:26.6147772Z }want:
2024-11-11T06:29:26.6148667Z HttpResponse{StatusCode: 200, Body: , Headers: map[], Custom: <nil>}
2024-11-11T06:29:26.6149610Z stdout:
2024-11-11T06:29:26.6149874Z 
2024-11-11T06:29:26.6150634Z stderr:  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2024-11-11T06:29:26.6152615Z                                  Dload  Upload   Total   Spent    Left  Speed
2024-11-11T06:29:26.6153304Z 
2024-11-11T06:29:26.6154331Z   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.96.217.196:18080...
2024-11-11T06:29:26.6154971Z 
2024-11-11T06:29:26.6155483Z   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
2024-11-11T06:29:26.6156511Z   0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
2024-11-11T06:29:26.6158089Z   0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0* Connection timed out after 3000 milliseconds
2024-11-11T06:29:26.6158815Z 
2024-11-11T06:29:26.6159293Z   0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
2024-11-11T06:29:26.6159987Z * Closing connection 0
2024-11-11T06:29:26.6160572Z curl: (28) Connection timed out after 3000 milliseconds
2024-11-11T06:29:26.6161265Z command terminated with exit code 28
2024-11-11T06:29:26.6161660Z 
2024-11-11T06:29:26.6161668Z 
2024-11-11T06:29:26.6161896Z === NAME  TestGlooGatewayEdgeGateway
2024-11-11T06:29:26.6164129Z     curl.go:140: 
2024-11-11T06:29:26.6164715Z         Failed after 4.222s.
2024-11-11T06:29:26.6166434Z         The function passed to Consistently failed at /home/runner/work/gloo/gloo/test/kubernetes/testutils/assertions/curl.go:131 with:
2024-11-11T06:29:26.6167837Z         Unexpected error:
2024-11-11T06:29:26.6168528Z             <*errors.withStack | 0xc00385eba0>: 
2024-11-11T06:29:26.6169218Z             exit status 28
2024-11-11T06:29:26.6169683Z             {
2024-11-11T06:29:26.6170406Z                 error: <*exec.ExitError | 0xc0037d23c0>{
2024-11-11T06:29:26.6171220Z                     ProcessState: {
2024-11-11T06:29:26.6171938Z                         pid: 75958,
2024-11-11T06:29:26.6172672Z                         status: 7168,
2024-11-11T06:29:26.6173361Z                         rusage: {
2024-11-11T06:29:26.6174337Z                             Utime: {Sec: 0, Usec: 50494},
2024-11-11T06:29:26.6175402Z                             Stime: {Sec: 0, Usec: 22091},
2024-11-11T06:29:26.6176303Z                             Maxrss: 555648,
2024-11-11T06:29:26.6177060Z                             Ixrss: 0,
2024-11-11T06:29:26.6177788Z                             Idrss: 0,
2024-11-11T06:29:26.6178503Z                             Isrss: 0,
2024-11-11T06:29:26.6179665Z                             Minflt: 2228,
2024-11-11T06:29:26.6180414Z                             Majflt: 0,
2024-11-11T06:29:26.6181171Z                             Nswap: 0,
2024-11-11T06:29:26.6182015Z                             Inblock: 0,
2024-11-11T06:29:26.6183257Z                             Oublock: 0,
2024-11-11T06:29:26.6184130Z                             Msgsnd: 0,
2024-11-11T06:29:26.6184946Z                             Msgrcv: 0,
2024-11-11T06:29:26.6185733Z                             Nsignals: 0,
2024-11-11T06:29:26.6186636Z                             Nvcsw: 336,
2024-11-11T06:29:26.6222037Z                             Nivcsw: 38,
2024-11-11T06:29:26.6222535Z                         },
2024-11-11T06:29:26.6222916Z                     },
2024-11-11T06:29:26.6223259Z                     Stderr: nil,
2024-11-11T06:29:26.6223582Z                 },
2024-11-11T06:29:26.6224931Z                 stack: [0x36037ae, 0x481eb63, 0x481e9ca, 0x481e5bc, 0x6cce945, 0x243795d, 0x2436936, 0x3612dea, 0x3614b6c, 0x36102c8, 0x6cce74b, 0x7407499, 0x243795d, 0x2436936, 0x35e6c72, 0x24d515f, 0x23615c1],
2024-11-11T06:29:26.6226118Z             }
2024-11-11T06:29:26.6226358Z         occurred
2024-11-11T06:29:26.6698878Z configmap "otel-agent-conf" deleted
2024-11-11T06:29:26.6713905Z daemonset.apps "otel-agent" deleted
2024-11-11T06:29:26.6849679Z configmap "otel-collector-conf" deleted
2024-11-11T06:29:26.6913225Z pod "otel-collector" deleted
2024-11-11T06:29:26.7384478Z service "otel-collector" deleted
2024-11-11T06:29:27.7369029Z upstream.gloo.solo.io "echo-server" deleted
2024-11-11T06:29:27.7394875Z upstream.gloo.solo.io "opentelemetry-collector" deleted
2024-11-11T06:29:27.7454453Z virtualservice.gateway.solo.io "default" deleted
2024-11-11T06:29:27.7488090Z gateway.gateway.solo.io "gateway-proxy-tracing" deleted
2024-11-11T06:29:27.7644445Z service "gateway-proxy-tracing" deleted
2024-11-11T06:29:27.7752077Z === NAME  TestGlooGatewayEdgeGateway/Tracing/TestSpanNameTransformationsWithoutRouteDecorator

Additional Information

No response

@sheidkamp
Copy link
Contributor

@sheidkamp
Copy link
Contributor

@sam-heilbron
Copy link
Contributor

PR: solo-io#10406

@sam-heilbron
Copy link
Contributor

Duplicate issue: solo-io#10365

@sam-heilbron
Copy link
Contributor

With the merging of solo-io#10409, I expect this flake to not longer occur on the main (1.18) branch. There is still an underlying bug that is referenced in the code: #10293. I am marking this issue as closed as I expect the flake to not occur. If we see that this fails on the solo-io/gloo@main branch, then we should re-open this.

@shashankram
Copy link
Contributor

Seeing in 1.17.x. @ashishb-solo @sam-heilbron could you backport if not already done?

     --- FAIL: TestGlooGatewayEdgeGateway/Tracing (87.91s)
        --- PASS: TestGlooGatewayEdgeGateway/Tracing/TestSpanNameTransformationsWithRouteDecorator (17.80s)
        --- FAIL: TestGlooGatewayEdgeGateway/Tracing/TestSpanNameTransformationsWithoutRouteDecorator (18.04s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Prioritized Indicating issue prioritized to be worked on in RFE stream release/1.18 Type: Bug Something isn't working Type: CI Test Flake A non-deterministic test that slows down development
Projects
None yet
Development

No branches or pull requests

5 participants