Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cluster stats names more parseable #10439

Merged
merged 14 commits into from
Dec 5, 2024
Merged

Conversation

jenshu
Copy link

@jenshu jenshu commented Dec 5, 2024

Description

When Kubernetes Gateway integration is enabled (i.e. Gloo Gateway is installed with kubeGateway.enabled=true), stats names for envoy clusters corresponding to Kubernetes Services or kube-type Upstreams are now in a new more parseable format that allows us to extract info about the Service for metrics.

For a Service destination, the cluster stats name will be in the format:
kube-svc_upstreamName_upstreamNs_svcNs_svcName_svcPort
For a Kube Upstream destination, the cluster stats name will be in the format:
kube-upstream_upstreamName_upstreamNs_svcNs_svcName_svcPort
Other destination types will fall back to the current stats format (which is the same as the cluster name):
upstreamName_upstreamNs

Notes:

  • Upstream names and cluster names are not affected; only the names of emitted stats are affected.
  • As long as Kubernetes Gateway integration is enabled , both Edge and Kubernetes Gateway proxies will use the new stats name format, for kube-type upstreams/services only.
  • For other upstream types, or if Kubernetes Gateway integration is not enabled, the stats names are the same as before (upstreamName_upstreamNs).

Background

See #10403 for previous attempt and background info.

Testing

Unit tests and a metrics e2e test suite were added.

This can also be tested manually via the following steps:

# clean existing cluster and build/install local chart:
kind delete cluster --name kind
rm -rf _test _output
ci/kind/setup-kind.sh
helm upgrade -i -n gloo-system gloo ./_test/gloo-1.0.0-ci1.tgz --create-namespace --set kubeGateway.enabled=true

# create example service
k apply -f https://raw.githubusercontent.com/solo-io/gloo/v1.13.x/example/petstore/petstore.yaml

# create gw and route to service
k apply -f - <<EOF
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: gw1
spec:
  gatewayClassName: gloo-gateway
  listeners:
    - protocol: HTTP
      port: 8080
      name: http
      allowedRoutes:
        namespaces:
          from: All
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: example-route1
spec:
  parentRefs:
    - name: gw1
  hostnames:
    - "example1.com"
  rules:
    - backendRefs:
        - name: petstore
          port: 8080
EOF

# send traffic (curl a few times)
k port-forward -n default deploy/gloo-proxy-gw1 8080
curl -v -H "host: example1.com" localhost:8080/api/pets

# check stats
k port-forward -n default deploy/gloo-proxy-gw1 19000
http://localhost:19000/stats/prometheus

# look for a line like below with the kube-svc cluster name, and ensure number of requests shown matches the number of requests sent
envoy_cluster_upstream_rq_total{envoy_cluster_name="kube-svc_default-petstore-8080_default_default_petstore_8080"}

# now try routing to an upstream (instead of service):
k apply -f - <<EOF
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: petstore-upstream
spec:
  kube:
    serviceName: petstore
    serviceNamespace: default
    servicePort: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: example-route1
spec:
  parentRefs:
    - name: gw1
  hostnames:
    - "example1.com"
  rules:
    - backendRefs:
        - name: petstore-upstream
          port: 80
          kind: Upstream
          group: gloo.solo.io
EOF

# send traffic again
curl -v -H "host: example1.com" localhost:8080/api/pets

# check stats, there should be a number of requests shown for this cluster name:
envoy_cluster_upstream_rq_total{envoy_cluster_name="kube-upstream_petstore-upstream_default_default_petstore_8080"}

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@solo-changelog-bot
Copy link

Issues linked to changelog:
https://github.com/solo-io/solo-projects/issues/7105

Copy link

github-actions bot commented Dec 5, 2024

Visit the preview URL for this PR (updated for commit 0e92ee7):

https://gloo-edge--pr10439-cluster-stats-name-k99ozzri.web.app

(expires Thu, 12 Dec 2024 16:27:36 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@jenshu jenshu mentioned this pull request Dec 5, 2024
4 tasks
@jenshu jenshu changed the title Cluster stats name Make cluster stats names more parseable Dec 5, 2024
@@ -0,0 +1,10 @@
changelog:
- type: BREAKING_CHANGE
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if this should be considered breaking or not, given that it only affects stats names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, dashboards and alerts can break, so I'd say it is a breaking change.

@@ -115,7 +114,7 @@ func (u *Updater) UpstreamUpdated(upstream *v1.Upstream) {

func (u *Updater) UpstreamAdded(upstream *v1.Upstream) {
// upstream already tracked. ignore.
key := translator.UpstreamToClusterName(upstream.GetMetadata().Ref())
key := upstream.GetMetadata().Ref().Key()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was part of the cleanup i carried over from #10403. in some plugins (etc) we are just using the "cluster name" as key for a map used internally, and i changed those instances to use Ref.Key instead

@@ -156,6 +158,10 @@ func (t *translatorInstance) initializeCluster(
DnsRefreshRate: dnsRefreshRate,
PreconnectPolicy: preconnect,
}
// for kube gateway, use new stats name format
if envutils.IsEnvTruthy(constants.GlooGatewayEnableK8sGwControllerEnv) {
out.AltStatName = UpstreamToClusterStatsName(upstream)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the most important part

@@ -156,6 +158,10 @@ func (t *translatorInstance) initializeCluster(
DnsRefreshRate: dnsRefreshRate,
PreconnectPolicy: preconnect,
}
// for kube gateway, use new stats name format
if envutils.IsEnvTruthy(constants.GlooGatewayEnableK8sGwControllerEnv) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do things like this elsewhere but after looking at it I feel like we should be checking these once at the start of a translation rather than checking mid translation where hypothetically some clusters may have this set true and some set false.

Not a problem with this pr more of a thought for our general strategies

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we have this check and/or the proxy type check sprinkled throughout the codebase and it's a bit messy..but luckily we should be able to rip them out soon!

this is the only place that i saw that created a Cluster (besides some plugins that create Clusters for other purposes but with well-defined names that don't need this alt stat name)

Copy link

@davidjumani davidjumani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified following the steps mentioned

@soloio-bulldozer soloio-bulldozer bot merged commit 0fd03ce into main Dec 5, 2024
20 checks passed
@soloio-bulldozer soloio-bulldozer bot deleted the cluster-stats-name branch December 5, 2024 16:52
jenshu added a commit that referenced this pull request Dec 5, 2024
Co-authored-by: Nathan Fudenberg <nathan.fudenberg@solo.io>
kevin-shelaga pushed a commit that referenced this pull request Dec 13, 2024
Co-authored-by: Nathan Fudenberg <nathan.fudenberg@solo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants