Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problem: EKS control plane overwhelmed when applying large number of workloads with MeshServices enabled #12723

Closed
bartsmykla opened this issue Jan 31, 2025 · 2 comments
Assignees
Labels
area/kuma-cp area/performance kind/bug A bug triage/accepted The issue was reviewed and is complete enough to start working on it
Milestone

Comments

@bartsmykla
Copy link
Contributor

bartsmykla commented Jan 31, 2025

Warning

This issue is work in progress and describes current knowledge about this problem

Kuma Version

any

Describe the bug

Summary

When applying a large number of services using MeshServices in performance tests, the control plane becomes overwhelmed at around 600-700 services. This leads to excessive logging, issues with syncing deployments, and eventually, etcd failures that cause Kubernetes control plane components to restart.

Observed issues

  1. Frequent deployment sync errors in kube-controller-manager

    "Error syncing deployment" deployment="kuma-test/fake-service-001" err="Operation cannot be fulfilled on deployments.apps \"fake-service-001\": the object has been modified; please apply your changes to the latest version and try again"
    

    These errors appear repeatedly for many deployments.

  2. Endpoint slice errors before etcd becomes unstable

    "Error syncing endpoint slices for service, retrying" key="kuma-test/fake-service-017" err="EndpointSlice informer cache is out of date"
    

    These errors happen across multiple services, causing delays and resource contention.

  3. Complete failure of etcd connections

    • The system eventually cannot handle the load, leading to failures in etcd.
    • Kubernetes control plane components restart as a result.

To Reproduce

  1. Enable exclusive MeshServices for the mesh
  2. Deploy a large number of services using fake-service (e.g., 1000 services).
  3. At around 600-700 services, control plane logs start filling up with errors.
  4. Kubernetes components (kube-controller-manager, endpointslice_controller) log multiple retries and failures.
  5. Eventually, connections to etcd fail completely, leading to control plane restarts.

Expected behavior

No response

Additional context (optional)

logs-insights-results-perf-tests-debugging-250131.json

Issues started around 09:50:
Image
Image
Image
Image

@bartsmykla bartsmykla added area/kuma-cp area/performance kind/bug A bug triage/accepted The issue was reviewed and is complete enough to start working on it labels Jan 31, 2025
@bartsmykla bartsmykla added this to the 2.10.x milestone Jan 31, 2025
@bartsmykla bartsmykla self-assigned this Jan 31, 2025
@bartsmykla
Copy link
Contributor Author

bartsmykla commented Feb 14, 2025

Additional data (WIP):

Below details represent perf test runs for amount services (first number) * amount of replicas (second number)

The pink areas on the graphs show when the should distribute certs when mTLS is enabled test run, and the grey areas show the duration of the kuma-test namespace deletion.

100 * 2

Image

200 * 2

Image

300 * 2

Image

400 * 2

Image

500 * 2

Image

600 * 2

Image

700 * 2

Image

800 * 2

Image

900 * 2

Image

1000 * 2

Image

1000 * 2 / ran in the same cluster after earlier 1000 * 2 test run

Image

@bartsmykla
Copy link
Contributor Author

bartsmykla commented Feb 21, 2025

Warning

This comment is Work in Progress

Summary of Investigation on Performance Issues

After switching our performance tests to use MeshServices, we observed consistent failures, primarily due to request timeouts in the AfterEach block when pushing metrics to Prometheus. An investigation was initiated based on the assumption that introducing MeshServices had significantly degraded performance.

Investigation Process & Findings

  1. Initial Assumption: Prometheus Overload

    • Suspected that Prometheus' node was overloaded due to test services running on the same node.
    • Isolated Prometheus to a dedicated node, but tests continued to fail.
  2. Control Plane Logs Indicated ETCD Issues

    • Initially interpreted logs as ETCD crashing, leading to full control plane restarts.
    • Assumed that our mutating/validating webhooks might be overloading ETCD.
    • This turned out to be incorrect; the actual issue was API server instances exceeding resource limits.
  3. API Server Scaling Issues

    • API server instances had a default limit of 1 CPU and 2GB RAM.
    • During tests, resource usage exceeded this limit, triggering EKS to scale the control plane.
    • API server scaling caused request timeouts when requests hit instances being terminated or were delayed due to overload.
  4. ETCD Calls Analysis

    • Event Emission: Initially suspected excessive event emissions related to MeshServices but confirmed events only occur on creation or update, so initial bursts are expected but not problematic.
    • Pod Calls vs. Dataplane Calls: Higher pod-related calls were observed, which was traced back to controllers watching for pod changes, especially during sidecar injection.

Observations & Open Issues

  • MeshServices Calls to ETCD Without Sidecar Injection

    • MeshServices controller still sends delete requests when labels change on namespaces, K8s Services, or EndpointSlices.
    • Needs further assessment for potential optimization.
  • Errors During Namespace Deletion

    • Observed recurring errors such as:
      ERROR discovery.k8s.pod-to-dataplane-converter could not get K8S Service for service tag {"error": "failed to get Service \"kuma-test/fake-service-799\": Service \"fake-service-799\" not found"}
      
      ERROR xds.sink failed to flush DataplaneInsight {"dataplaneid": {"Mesh":"default","Name":"fake-service-736-6b479fb987-zjsxr.kuma-test"}, "error": "failed to create k8s resource: dataplaneinsights.kuma.io \"fake-service-736-6b479fb987-zjsxr\" is forbidden: unable to create new content in namespace kuma-test because it is being terminated"}
      
    • Issue opened: Avoid unnecessary processing when namespace is terminating #12858
      • First error is related to configmap controller and should likely be logged as debug info instead of an error.
      • Second error requires further investigation.
  • High Resource Consumption by Webhook

    • mesh.defaulter webhook identified as the highest resource consumer.

Bug Identified & Fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kuma-cp area/performance kind/bug A bug triage/accepted The issue was reviewed and is complete enough to start working on it
Projects
None yet
Development

No branches or pull requests

1 participant