Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] k8s gateway cluster names #10403

Closed
wants to merge 28 commits into from
Closed

[DNM] k8s gateway cluster names #10403

wants to merge 28 commits into from

Conversation

jenshu
Copy link

@jenshu jenshu commented Nov 25, 2024

Description

When Kubernetes Gateway integration is enabled (i.e. Gloo Gateway is installed with kubeGateway.enabled=true), envoy cluster names corresponding to Kubernetes Services or kube-type Upstreams are now in a new more parseable format that allows us to extract info about the Service for metrics.

For a Service destination, the cluster name will be in the format:
kube-svc:upstreamName_upstreamNs_svcNs_svcName_svcPort
For a Kube Upstream destination, the cluster name will be in the format:
kube-upstream:upstreamName_upstreamNs_svcNs_svcName_svcPort

Notes:

  • Upstream names (discovered, in-memory, or manually-created) are not affected; only the cluster names in envoy are affected
  • As long as Kubernetes Gateway integration is enabled , both Edge and Kubernetes Gateway proxies will use the new cluster format, for kube-type upstreams/services only.
    • For other upstream types, the cluster format remains the same as before (upstreamName_upstreamNs).
    • If Kubernetes Gateway integration is not enabled, cluster names are not affected. This is to avoid causing existing edge users to run into unexpected changes during upgrade.

Background

Currently, the cluster names we use in envoy are constructed from upstreams. There are a few types of upstreams:

  • Manually-created. These can be one of several types (kube, static, etc) and the upstream name can be arbitrary.
  • Discovered from k8s services. These upstreams are created by the discovery pod (when discovery is enabled) from k8s services and the upstream names follow the format svcNs-svcName-svcPort
  • In-memory upstreams created from k8s services. These upstreams are not persisted, but are only used in memory to represent a k8s service. The upstream names follow the format kube-svc:svcNs-svcName-svcPort

In all of the above cases, the cluster name was previously constructed by appending the upstream name and namespace (e.g. in-memory upstreams would result in a cluster name kube-svc:svcNs-svcName-svcPort_upstreamNs). Since hyphens can appear in names/namespaces, it was difficult to parse out the workload (service) name/namespace just given the cluster name (also it wouldn't work at all for manually-created upstreams since those names can be anything).

To fix this, we now extract info from the Upstream (regardless of how it was created) to put in the cluster name and use _ as a separator. Currently we only do this for "kube" type upstreams, but we can later extend to other types (static, aws, etc). Also since cluster name changes could potentially result in some downtime during an upgrade, we are only making this change if k8s gateway integration is enabled.

Code changes

The main change was that the upstream-to-cluster conversion code has changed to take in an Upstream object, rather than just the upstream ref (so we can extract info from the spec and use it to contruct the cluster name). As a result, all the places that used this code had some refactoring done. There were 3 categories of places that called UpstreamToClusterName:

  1. some plugins used UpstreamToClusterName only to create a key to temporarily store upstream specs during translation (e.g. search for recordedUpstreams in the code). There was no need to use the new conversion code here as all we need is a unique key, so I changed these places to just use the upstream ref Key()
  2. some places used UpstreamToClusterName to create a key for status reporting, i didn't want to make any user-facing changes here so kept the format the same as before (upstreamName_upstreamNs)
  3. everywhere else that used UpstreamToClusterName to create an actual cluster name, now need to pass in the full upstream object

Interesting decisions

We went through a few iterations of how to do this. We first tried storing service name/namespace as annotations on the Upstream, however that only worked for in-memory upstreams since we didn't want to modify persisted upstreams with newly added annotations. We tried doing the new cluster name format only for k8s gateway proxies, but in those cases we kept running into issues with not having the necessary info the plumb through everywhere it was needed (e.g. proxy type wasn't always available when we try to convert upstream to cluster names). Landed on current design as it was easier just to turn on/off the behavior globally based on whether the k8s gateway enabled flag was on, and passing the whole upstream because it had all the info we needed on the spec.

Testing

Unit tests were added for converting between upstreams and cluster names.

End-to-end testing was manually done via the following steps:

# clean existing cluster and build/install local chart:
kind delete cluster --name kind
rm -rf _test _output
ci/kind/setup-kind.sh
helm upgrade -i -n gloo-system gloo ./_test/gloo-1.0.0-ci1.tgz --create-namespace --set kubeGateway.enabled=true

# create example service
k apply -f https://raw.githubusercontent.com/solo-io/gloo/v1.13.x/example/petstore/petstore.yaml

# create gw and route to service
k apply -f - <<EOF
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: gw1
spec:
  gatewayClassName: gloo-gateway
  listeners:
    - protocol: HTTP
      port: 8080
      name: http
      allowedRoutes:
        namespaces:
          from: All
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: example-route1
spec:
  parentRefs:
    - name: gw1
  hostnames:
    - "example1.com"
  rules:
    - backendRefs:
        - name: petstore
          port: 8080
EOF

# send traffic (curl a few times)
k port-forward -n default deploy/gloo-proxy-gw1 8080
curl -v -H "host: example1.com" localhost:8080/api/pets

# check stats
k port-forward -n default deploy/gloo-proxy-gw1 19000
http://localhost:19000/stats/prometheus

# look for a line like below with the kube-svc cluster name, and ensure number of requests shown matches the number of requests sent
envoy_cluster_upstream_rq_total{envoy_cluster_name="kube-svc_default-petstore-8080_default_default_petstore_8080"}

# now try routing to upstream (instead of service):
k apply -f - <<EOF
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: petstore-upstream
spec:
  kube:
    serviceName: petstore
    serviceNamespace: default
    servicePort: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: example-route1
spec:
  parentRefs:
    - name: gw1
  hostnames:
    - "example1.com"
  rules:
    - backendRefs:
        - name: petstore-upstream
          port: 80
          kind: Upstream
          group: gloo.solo.io
EOF

# send traffic again
curl -v -H "host: example1.com" localhost:8080/api/pets

# check stats, there should be a number of requests shown for this cluster name:
envoy_cluster_upstream_rq_total{envoy_cluster_name="kube-upstream_petstore-upstream_default_default_petstore_8080"}
 

Automated e2e testing, which sends traffic to a route and ensures the right cluster names show up in the envoy metrics, will be added in a follow-up PR.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@@ -68,10 +66,6 @@ type translatorInstance struct {
shouldEnforceNamespaceMatch bool
}

func NewDefaultTranslator(settings *v1.Settings, pluginRegistry plugins.PluginRegistry) *translatorInstance {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this wasn't used anywhere

@jenshu jenshu marked this pull request as ready for review November 25, 2024 19:30
@solo-changelog-bot
Copy link

Issues linked to changelog:
https://github.com/solo-io/solo-projects/issues/7105

Copy link

github-actions bot commented Nov 25, 2024

Visit the preview URL for this PR (updated for commit 9f6c3ff):

https://gloo-edge--pr10403-kubegateway-clustern-c73iax59.web.app

(expires Mon, 09 Dec 2024 20:56:24 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

}

// Don't use dots in the name as it messes up prometheus stats
return fmt.Sprintf("%s_%s", ref.GetName(), ref.GetNamespace())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a limit on the max length? IIRC grafana has something around 128 chars which might be similar for promethus

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure which Grafana limit you have in mind (it's just the visualization layer backed by a regular database), but in Prometheus one can specify limits to label name/values in bytes. These are all 0 by default.

While no one is a big fan of long metric/label names, Envoy is doing a pretty terrible job on this front already and we aren't really making that worse.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A max length for the Prometheus metric key. Just want to be on the safe side in case there's a limit and the key gets truncated

@@ -30,7 +54,7 @@ var _ = Describe("Plugin", func() {
},
}
out := &envoy_config_route_v3.Route{}
err := p.ProcessRoute(plugins.RouteParams{}, in, out)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need to test an empty RouteParams ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realistically the RouteParams will never be empty, we have several plugins that do a params.Snapshot.<resourceType>.Find without doing a nil check. This plugin now does a Find where it didn't used to, so thats why the test had to change to pass in a non-empty route params

Copy link

@sam-heilbron sam-heilbron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need an upgrade test to prove that this is a zero downtime upgrade OR clear documentation outlining how to safely perform an upgrade from 1.17 to 1.18

description: >-
When using the Kubernetes Gateway API, envoy cluster names corresponding to Kubernetes Service Upstreams
are now in a new more parseable format that allows us to extract info about the Service for metrics:
`upstreamName_upstreamNs_svcNs_svcName_svcPort` (underscore-separated). Note that as long as Kubernetes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this up-to-date: upstreamName_upstreamNs_svcNs_svcName_svcPort?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

@jenshu jenshu mentioned this pull request Dec 3, 2024
4 tasks
@jenshu
Copy link
Author

jenshu commented Dec 5, 2024

closing in favor of #10439 which only modifies stats names

@jenshu jenshu closed this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants