Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream not found when configuring opentelemetry collector should be a warning, not an error #10293

Closed
ashishb-solo opened this issue Nov 7, 2024 · 9 comments · Fixed by solo-io/gloo#10458
Assignees
Labels
Prioritized Indicating issue prioritized to be worked on in RFE stream release/1.17 release/1.18 Type: Bug Something isn't working

Comments

@ashishb-solo
Copy link
Contributor

ashishb-solo commented Nov 7, 2024

Gloo Edge Product

Open Source

Gloo Edge Version

1.17.16

Kubernetes Version

n/a

Describe the bug

If the upstream is not found in a tracing collector, it should produce a warning and not an error

apiVersion: gateway.solo.io/v1
kind: Gateway
metadata:
  labels:
    app: gloo
  name: gateway-proxy
  namespace: gloo-system
spec:
  bindAddress: '::'
  bindPort: 8080
  httpGateway:
    options:
      httpConnectionManagerSettings:
        tracing:
          openTelemetryConfig:
            collectorUpstreamRef: # CHANGE THIS TO A NON-EXISTING UPSTREAM
              namespace: gloo-system
              name: opentelemetry-collector

Once we fix this, we should remove the Eventually statement in this conversation

Expected Behavior

Produce a warning and not an error

Steps to reproduce the bug

  1. apply otel-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-agent-conf
  namespace: gloo-system
  labels:
    app: opentelemetry
    component: otel-agent-conf
data:
  otel-agent-config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    exporters:
      otlp:
        endpoint: "otel-collector.default:4317"
        tls:
          insecure: true
        sending_queue:
          num_consumers: 4
          queue_size: 100
        retry_on_failure:
          enabled: true
    processors:
      batch:
      memory_limiter:
        # 80% of maximum memory up to 2G
        limit_mib: 400
        # 25% of limit up to 2G
        spike_limit_mib: 100
        check_interval: 5s
    extensions:
      zpages: {}
      memory_ballast:
        # Memory Ballast size should be max 1/3 to 1/2 of memory.
        size_mib: 165
    service:
      extensions: [zpages, memory_ballast]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-agent
  namespace: gloo-system
  labels:
    app: opentelemetry
    component: otel-agent
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-agent
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-agent
    spec:
      containers:
      - command:
          - "/otelcol"
          - "--config=/conf/otel-agent-config.yaml"
        image: otel/opentelemetry-collector:0.68.0
        name: otel-agent
        resources:
          limits:
            cpu: 500m
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 55679 # ZPages endpoint.
        - containerPort: 4317 # Default OpenTelemetry receiver port.
        - containerPort: 8888  # Metrics.
        volumeMounts:
        - name: otel-agent-config-vol
          mountPath: /conf
      volumes:
        - configMap:
            name: otel-agent-conf
            items:
              - key: otel-agent-config
                path: otel-agent-config.yaml
          name: otel-agent-config-vol
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
  namespace: gloo-system
  labels:
    app: opentelemetry
    component: otel-collector-conf
data:
  otel-collector-config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch:
      memory_limiter:
        # 80% of maximum memory up to 2G
        limit_mib: 1500
        # 25% of limit up to 2G
        spike_limit_mib: 512
        check_interval: 5s
    extensions:
      zpages: {}
      memory_ballast:
        # Memory Ballast size should be max 1/3 to 1/2 of memory.
        size_mib: 683
    exporters:
      logging:
        loglevel: debug
      zipkin:
        endpoint: "http://zipkin:9411/api/v2/spans"
        tls:
          insecure: true
    service:
      extensions: [zpages, memory_ballast]
      pipelines:
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging, zipkin]
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: gloo-system
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  ports:
  - name: otlp-grpc # Default endpoint for OpenTelemetry gRPC receiver.
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: otlp-http # Default endpoint for OpenTelemetry HTTP receiver.
    port: 4318
    protocol: TCP
    targetPort: 4318
  - name: metrics # Default endpoint for querying metrics.
    port: 8888
  selector:
    component: otel-collector
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: gloo-system
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector
  minReadySeconds: 5
  progressDeadlineSeconds: 120
  replicas: 1 #TODO - adjust this to your own requirements
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-collector
    spec:
      containers:
      - command:
          - "/otelcol"
          - "--config=/conf/otel-collector-config.yaml"
        image: otel/opentelemetry-collector:0.68.0
        name: otel-collector
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
        ports: # Comment out ports for platforms as needed.
        - containerPort: 55679 # Default endpoint for ZPages.
        - containerPort: 4317 # Default endpoint for OpenTelemetry receiver.
        - containerPort: 14250 # Default endpoint for Jaeger gRPC receiver.
        - containerPort: 14268 # Default endpoint for Jaeger HTTP receiver.
        - containerPort: 9411 # Default endpoint for Zipkin receiver.
        - containerPort: 8888  # Default endpoint for querying metrics.
        volumeMounts:
        - name: otel-collector-config-vol
          mountPath: /conf
#        - name: otel-collector-secrets
#          mountPath: /secrets
      volumes:
        - configMap:
            name: otel-collector-conf
            items:
              - key: otel-collector-config
                path: otel-collector-config.yaml
          name: otel-collector-config-vol
#        - secret:
#            name: otel-collector-secrets
#            items:
#              - key: cert.pem
#                path: cert.pem
#              - key: key.pem
#                path: key.pem
  1. apply tracing.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: echo-server
  name: echo-server
  namespace: gloo-system
spec:
  selector:
    matchLabels:
      app: echo-server
  replicas: 1
  template:
    metadata:
      labels:
        app: echo-server
    spec:
      containers:
      # remember to change port to 80 and disable http2 if using httpbin
      # - image: kennethreitz/httpbin
      - image: jmalloc/echo-server
        name: echo-server
        env:
        - name: LOG_HTTP_HEADERS
          value: "true"
        - name: LOG_HTTP_BODY
          value: "true"
        ports:
        - containerPort: 8080
          name: grpc
---
apiVersion: v1
kind: Service
metadata:
  name: echo-server
  namespace: gloo-system
  labels:
    service: echo-server
spec:
  ports:
  - port: 8080
    # targetPort: 80
    protocol: TCP
  selector:
    app: echo-server
---
# gloo resources
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: echo-server
  namespace: gloo-system
spec:
  useHttp2: true
  static:
    hosts:
    - addr: echo-server
      port: 8080
---
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: "opentelemetry-collector"
  namespace: gloo-system
spec:
  # REQUIRED FOR OPENTELEMETRY COLLECTION
  useHttp2: true
  kube:
    # selector:
    serviceName: otel-collector
    serviceNamespace: gloo-system
    servicePort: 4317
  # static:
  #   hosts:
  #     - addr: "otel-collector"
  #       port: 4317
---
apiVersion: gateway.solo.io/v1
kind: Gateway
metadata:
  labels:
    app: gloo
  name: gateway-proxy
  namespace: gloo-system
spec:
  bindAddress: '::'
  bindPort: 8080
  httpGateway:
    options:
      httpConnectionManagerSettings:
        tracing:
          openTelemetryConfig:
            collectorUpstreamRef: # CHANGE THIS TO A NON-EXISTING UPSTREAM
              namespace: gloo-system
              name: opentelemetry-collector
---
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: default
  namespace: gloo-system
spec:
  virtualHost:
    domains:
    - '*'
    options:
      stagedTransformations:
        regular:
          requestTransforms:
            - responseTransformation:
                transformationTemplate:
                  headers:
                    test_header:
                      text: '{{header("ABCD")}}'
                  spanTransformer:
                    name:
                      text: '{{header("Host")}}'
    routes:
    - matchers:
       - prefix: /route1
      routeAction:
        single:
          upstream:
            name: echo-server
            namespace: gloo-system
      options:
        autoHostRewrite: true
        # tracing:
        #   routeDescriptor: MYTESTROUTEDESCRIPTOR
    - matchers:
       - prefix: /route2
      routeAction:
        single:
          upstream:
            name: echo-server
            namespace: gloo-system
      options:
        autoHostRewrite: true
        # tracing:
        #   routeDescriptor: YETANOTHERROUTEDESCRIPTOR
    - matchers:
       - prefix: /
      routeAction:
        single:
          upstream:
            name: echo-server
            namespace: gloo-system
      options:
        autoHostRewrite: true
        # tracing:
        #   routeDescriptor: THISROUTEHASNOPREFIX
  1. Edit the marked line above (near collectorUpstreamRef) to an upstream that does not exist.

This should result in a warning, not an error.

Additional Environment Detail

No response

Additional Context

https://github.com/solo-io/gloo/pull/10123/files#r1803187220

Here is where the error is created: https://github.com/solo-io/gloo/blob/875b521197b4d4cbf79ed4008bf1c5cefd46b3db/projects/gloo/pkg/plugins/tracing/plugin.go#L311

It is uncertain which versions this affects - we should investigate all LTS branches to observe whether this behaviour is present and whether the fix should be backported.

Workarounds: see the test linked above, which wraps the the kubectl apply statement in an Eventually block. If this is hit in production, it may be necessary to ensure that the upstream is applied and accepted before the openTelemetryConfig upstream is configured.

See also: https://github.com/solo-io/solo-projects/issues/6321

┆Issue is synchronized with this Asana task by Unito

@davidjumani
Copy link
Contributor

davidjumani commented Nov 14, 2024

Looking into this, it appears that in this case (as in the case with all NetworkFilterPlugins), any error returned is treated as an HTTPListenerError and so is not checked if it is a warning : https://github.com/k8sgateway/k8sgateway/blob/0fe3e5a9e82dde867f2dc3361eb9cf78f700de2e/projects/controller/pkg/translator/network_filters.go#L173
Most other plugins (VirtualHostPlugins, RoutePlugin) have checks to see if the error returned is a warning and processed accordingly.
https://github.com/k8sgateway/k8sgateway/blob/0fe3e5a9e82dde867f2dc3361eb9cf78f700de2e/projects/controller/pkg/translator/route_config.go#L170 that eventually bubbles down to checking if it is an error or warning
https://github.com/k8sgateway/k8sgateway/blob/0fe3e5a9e82dde867f2dc3361eb9cf78f700de2e/projects/controller/pkg/translator/reporting.go#L173
This can be updated to follow suit. There appears to be a TODO for this here

@ashishb-solo
Copy link
Contributor Author

ashishb-solo commented Nov 14, 2024

Most other plugins have checks to see if the error returned is a warning and processed accordingly.

Yeah, I can at least (hopefully) help with the logic for detecting whether it's a warning or error. The error is created here in configuration_error.go. My assumption is that you should be able to cast the error in the tracing plugin to to the BaseConfigurationError type and call IsWarning() on it to check if it's a warning.

In terms of how you process the warning, I'm actually not 100% sure what that would look like though. I'm not sure if the tracing plugin has the structure required to handle that just yet unfortunately.

@davidjumani
Copy link
Contributor

davidjumani commented Nov 14, 2024

The error is properly created as a warning but when it is processed, the handler does not care if it is a warning and treats it as an error. The whole listener error reporting would need to be updated to handle warnings (at least the HTTPListener reports)

@sam-heilbron
Copy link
Contributor

In solo-io#10409 I attempted to reduce the errors we see in CI due to this bug. As part of this work, we MUST remove those changes in CI, so that we are effectively testing this with retries (ie proving that the system is eventually consistent)

@davidjumani
Copy link
Contributor

Have created a fix for this in solo-io#10458

@davidjumani
Copy link
Contributor

davidjumani commented Dec 17, 2024

This is fixed in OSS v1.18.1, v1.17.17

@ashishb-solo
Copy link
Contributor Author

@davidjumani i think we can close this then, no?

@davidjumani
Copy link
Contributor

Yes but don't have permissions to

@jenshu
Copy link
Contributor

jenshu commented Jan 2, 2025

per above comments, the fix has been merged

@jenshu jenshu closed this as completed Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Prioritized Indicating issue prioritized to be worked on in RFE stream release/1.17 release/1.18 Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants