All instances are down #440

miran248 · 2025-01-20T14:31:38Z

miran248
Jan 20, 2025

Hey,

Finally tried coroot today and must say i love it!
Everything, apart from the instance status worked first try!

Applications section

Instances section (first app)

For whatever reason all instances are down, no metrics apparently.
Also, in case of redis, all stand-by instances are constantly restarting, yet there's nothing in k8s events.

Any ideas?
(apart from coroot, i haven't deployed anything else)

Config is more or less default

apiVersion: coroot.com/v1
kind: Coroot
metadata:
  name: coroot
  namespace: coroot
spec:
  metricsRefreshInterval: 15s # Specifies the metric resolution interval.
  cacheTTL: 720h # Duration for which Coroot retains the metric cache.
  authBootstrapAdminPassword: admin-password # Initial admin password for bootstrapping.

Env:

coroot-operator chart 0.2.1, image 1.1.1
coroot-cluster-agent image 1.1.3
coroot-node-agent image 1.23.4
k8s v1.30.8-gke.1051000

Thanks,
Miran

Answered by def

Jan 21, 2025

Today, we released a new version of the agent that includes several fixes. While I didn’t expect this update to resolve your issue, the agent restarted because our operator automatically checks for new versions every hour and updates the components as needed.

Let’s mark this as resolved for now, but feel free to reopen the issue or raise a new one if anything goes wrong. Thanks again for your report!

View full answer

def · 2025-01-20T14:35:32Z

def
Jan 20, 2025
Maintainer

Hi @miran248, thank you for the report! This appears to be a bug. Could you please share the OS and kernel version used on the worker nodes?

0 replies

miran248 · 2025-01-20T14:43:13Z

miran248
Jan 20, 2025
Author

Hey @def, thanks for the lightning reply!

Here's the node info from one of the nodes

  nodeInfo:
    architecture: amd64
    bootID: ...
    containerRuntimeVersion: containerd://1.7.24
    kernelVersion: 6.1.112+
    kubeProxyVersion: v1.30.8-gke.1051000
    kubeletVersion: v1.30.8-gke.1051000
    machineID: ...
    operatingSystem: linux
    osImage: Container-Optimized OS from Google
    systemUUID: ...

And a few annotations (prob not important)

beta.kubernetes.io/instance-type=c2-standard-4
cloud.google.com/gke-boot-disk=pd-ssd
cloud.google.com/gke-container-runtime=containerd
cloud.google.com/gke-cpu-scaling-level=4
cloud.google.com/gke-logging-variant=DEFAULT
cloud.google.com/gke-max-pods-per-node=110
cloud.google.com/gke-memory-gb-scaling-level=16
cloud.google.com/gke-netd-ready=true
cloud.google.com/gke-os-distribution=cos
cloud.google.com/gke-provisioning=standard
cloud.google.com/gke-stack-type=IPV4
cloud.google.com/machine-family=c2
cloud.google.com/private-node=false
iam.gke.io/gke-metadata-server-enabled=true
kubernetes.io/arch=amd64
kubernetes.io/os=linux

0 replies

def · 2025-01-20T14:46:11Z

def
Jan 20, 2025
Maintainer

Thanks! We have a few hypotheses and will validate them on the same infrastructure. We'll keep you posted.

0 replies

miran248 · 2025-01-20T14:49:36Z

miran248
Jan 20, 2025
Author

Thanks!

Btw, here's the argocd app for redis, just in case.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-hasura-redis
  namespace: argocd
spec:
  destination:
    namespace: production
    server: "https://kubernetes.default.svc"
  project: default
  source:
    chart: redis
    repoURL: https://charts.bitnami.com/bitnami
    targetRevision: 20.3.0
    helm:
      releaseName: production-hasura-redis
      values: |
        architecture: replication
        auth:
          enabled: false
        master:
          count: 1
          disableCommands: []
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
          nodeSelector:
            env: shared
          persistence:
            enabled: true
            size: 1Gi
        sentinel:
          enabled: true
          masterSet: production-hasura-redis
          masterService:
            enabled: true
          persistence:
            enabled: true
            size: 1Gi
        replica:
          automountServiceAccountToken: true
        rbac:
          create: true
        metrics:
          enabled: true
  syncPolicy:
    automated:
      prune: true
      # selfHeal: true

0 replies

def · 2025-01-21T11:26:36Z

def
Jan 21, 2025
Maintainer

I couldn't reproduce this issue on my GKE cluster.

Can you please provide the output of kubectl describe pod ... for one of the Redis pods?

0 replies

miran248 · 2025-01-21T13:29:34Z

miran248
Jan 21, 2025
Author

Hey, here's the output from KUBECONFIG=kube-config kubectl describe pod -n production production-hasura-redis-node-2 | pbcopy

Name:             production-hasura-redis-node-2
Namespace:        production
Priority:         0
Service Account:  production-hasura-redis
Node:             gke-***-420faeb3-spii/10.154.15.233
Start Time:       Mon, 20 Jan 2025 17:53:38 +0100
Labels:           app.kubernetes.io/component=node
                  app.kubernetes.io/instance=production-hasura-redis
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=redis
                  app.kubernetes.io/version=7.4.2
                  apps.kubernetes.io/pod-index=2
                  controller-revision-hash=production-hasura-redis-node-c9d4b98d
                  helm.sh/chart=redis-20.6.3
                  isMaster=true
                  statefulset.kubernetes.io/pod-name=production-hasura-redis-node-2
Annotations:      checksum/configmap: a70e7bd40189c2fa6ebc89b540d827e69bede945408cb8a65732f7074c5adfbf
                  checksum/health: 7c1bc273685f377b654b2d5d82e81c96f6c59d6963786575eccc63a56d63a6cd
                  checksum/scripts: aeff7634605d8df25975c509533bdf36aada5d284afc6c1d50572ece5034313c
                  checksum/secret: 44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a
                  prometheus.io/port: 9121
                  prometheus.io/scrape: true
Status:           Running
IP:               10.76.2.61
IPs:
  IP:           10.76.2.61
Controlled By:  StatefulSet/production-hasura-redis-node
Containers:
  redis:
    Container ID:    containerd://89bd14917a4f81157618ce2694df1f63a4248fd679ef891f39531b1fba431ab2
    Image:           docker.io/bitnami/redis:7.4.2-debian-12-r0
    Image ID:        docker.io/bitnami/redis@sha256:65f55fefc0acd7f1a1da44b39be3044bcfbc03f4a49c4689453097f929f07132
    Port:            6379/TCP
    Host Port:       0/TCP
    SeccompProfile:  RuntimeDefault
    Command:
      /bin/bash
    Args:
      -c
      /opt/bitnami/scripts/start-scripts/start-node.sh
    State:          Running
      Started:      Mon, 20 Jan 2025 17:53:43 +0100
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   256Mi
    Liveness:   exec [sh -c /health/ping_liveness_local.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
    Readiness:  exec [sh -c /health/ping_readiness_local.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=5
    Startup:    exec [sh -c /health/ping_liveness_local.sh 5] delay=10s timeout=5s period=10s #success=1 #failure=22
    Environment:
      BITNAMI_DEBUG:                                       false
      REDIS_MASTER_PORT_NUMBER:                            6379
      ALLOW_EMPTY_PASSWORD:                                yes
      REDIS_TLS_ENABLED:                                   no
      REDIS_PORT:                                          6379
      REDIS_SENTINEL_TLS_ENABLED:                          no
      REDIS_SENTINEL_PORT:                                 26379
      REDIS_DATA_DIR:                                      /data
      STAKATER_PRODUCTION_HASURA_REDIS_SCRIPTS_CONFIGMAP:  73e9131546af07c4fc943bc9d908c071d7a5aa1b
    Mounts:
      /data from redis-data (rw)
      /health from health (rw)
      /opt/bitnami/redis-sentinel/etc from sentinel-data (rw)
      /opt/bitnami/redis/etc from empty-dir (rw,path="app-conf-dir")
      /opt/bitnami/redis/mounted-etc from config (rw)
      /opt/bitnami/scripts/start-scripts from start-scripts (rw)
      /tmp from empty-dir (rw,path="tmp-dir")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbjmm (ro)
  sentinel:
    Container ID:    containerd://d424309678bf768f473b736c0d63125a58c6918ad033fbe9d76d156c56b143ca
    Image:           docker.io/bitnami/redis-sentinel:7.4.2-debian-12-r0
    Image ID:        docker.io/bitnami/redis-sentinel@sha256:f2953d5e62b386bb2985043907f5c8af51b8466a9f9e1fc16fd0a500624bad46
    Port:            26379/TCP
    Host Port:       0/TCP
    SeccompProfile:  RuntimeDefault
    Command:
      /bin/bash
    Args:
      -c
      /opt/bitnami/scripts/start-scripts/start-sentinel.sh
    State:          Running
      Started:      Mon, 20 Jan 2025 17:53:46 +0100
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   256Mi
    Liveness:   exec [sh -c /health/ping_sentinel.sh 5] delay=20s timeout=5s period=10s #success=1 #failure=6
    Readiness:  exec [sh -c /health/ping_sentinel.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=6
    Startup:    exec [sh -c /health/ping_sentinel.sh 5] delay=10s timeout=5s period=10s #success=1 #failure=22
    Environment:
      BITNAMI_DEBUG:               false
      ALLOW_EMPTY_PASSWORD:        yes
      REDIS_SENTINEL_TLS_ENABLED:  no
      REDIS_SENTINEL_PORT:         26379
    Mounts:
      /data from redis-data (rw)
      /etc/shared from kubectl-shared (rw)
      /health from health (rw)
      /opt/bitnami/redis-sentinel/etc from sentinel-data (rw)
      /opt/bitnami/redis-sentinel/mounted-etc from config (rw)
      /opt/bitnami/scripts/start-scripts from start-scripts (rw)
      /tmp from empty-dir (rw,path="tmp-dir")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbjmm (ro)
  metrics:
    Container ID:    containerd://f8fc5a00cd2b97193213325bf2407f23622997247bd00720833362ddd822e405
    Image:           docker.io/bitnami/redis-exporter:1.67.0-debian-12-r0
    Image ID:        docker.io/bitnami/redis-exporter@sha256:5bc3229b94f62b593600ee74d0cd16c7a74df31852eb576bdc0f5e663c8e1337
    Port:            9121/TCP
    Host Port:       0/TCP
    SeccompProfile:  RuntimeDefault
    Command:
      /bin/bash
      -c
      if [[ -f '/secrets/redis-password' ]]; then
          export REDIS_PASSWORD=$(cat /secrets/redis-password)
      fi
      redis_exporter
      
    State:          Running
      Started:      Mon, 20 Jan 2025 17:53:50 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                150m
      ephemeral-storage:  2Gi
      memory:             192Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  50Mi
      memory:             128Mi
    Liveness:             tcp-socket :metrics delay=10s timeout=5s period=10s #success=1 #failure=5
    Readiness:            http-get http://:metrics/ delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      REDIS_ALIAS:                        production-hasura-redis
      REDIS_EXPORTER_WEB_LISTEN_ADDRESS:  :9121
    Mounts:
      /tmp from empty-dir (rw,path="tmp-dir")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbjmm (ro)
  kubectl-shared:
    Container ID:    containerd://10eb028dffcac550ff580c2cc7edee4468b1edb11a3e80615759295e317b276e
    Image:           docker.io/bitnami/kubectl:1.32.0-debian-12-r0
    Image ID:        docker.io/bitnami/kubectl@sha256:493d1b871556d48d6b25d471f192c2427571cd6f78523eebcaf4d263353c7487
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      /opt/bitnami/scripts/kubectl-scripts/update-master-label.sh
    State:          Running
      Started:      Mon, 20 Jan 2025 17:53:56 +0100
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/shared from kubectl-shared (rw)
      /opt/bitnami/scripts/kubectl-scripts from kubectl-scripts (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lbjmm (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  redis-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  redis-data-production-hasura-redis-node-2
    ReadOnly:   false
  sentinel-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  sentinel-data-production-hasura-redis-node-2
    ReadOnly:   false
  start-scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      production-hasura-redis-scripts
    Optional:  false
  health:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      production-hasura-redis-health
    Optional:  false
  kubectl-shared:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kubectl-scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      production-hasura-redis-kubectl-scripts
    Optional:  false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      production-hasura-redis-configuration
    Optional:  false
  empty-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-lbjmm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

(i upgraded redis to oci://registry-1.docker.io/bitnamicharts:20.6.3 later yesterday)

Initially i had issues, where coroot couldn't access metrics on some services, that was because i had metrics endpoints disabled - that was then fixed and those errors went away. Status however hasn't changed, all instances (including redis) are down.

0 replies

def · 2025-01-21T13:50:38Z

def
Jan 21, 2025
Maintainer

Could you also share a screenshot of the Memory tab for the redis app?

0 replies

miran248 · 2025-01-21T14:13:01Z

miran248
Jan 21, 2025
Author

Memory tab

I removed memory limits yesterday (they were set too low and because of that k8s would kill the instances once every few hours).

Not sure what happened, coroot is now showing all instance statuses! Everything's up now - it's been 24 hours since i deployed the app, maybe that's why?

coroot-node-agent got redeployed on its own (not restarted) at 12pm, maybe that's why it started working.
There were some errors shortly before that failed to read build info: could not read Go build info from /proc/3574391/exe: EOF

I think we can mark it as solved.

0 replies

miran248 · 2025-01-21T14:22:03Z

miran248
Jan 21, 2025
Author

Btw, now that everything's up, restarts have fixed themselves as well, they're all now at 0. This is last 60 mins.

0 replies

def · 2025-01-21T15:09:30Z

def
Jan 21, 2025
Maintainer

Today, we released a new version of the agent that includes several fixes. While I didn’t expect this update to resolve your issue, the agent restarted because our operator automatically checks for new versions every hour and updates the components as needed.

Let’s mark this as resolved for now, but feel free to reopen the issue or raise a new one if anything goes wrong. Thanks again for your report!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coroot

All instances are down #440

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Coroot

All instances are down #440

miran248 Jan 20, 2025

Replies: 10 comments

def Jan 20, 2025 Maintainer

miran248 Jan 20, 2025 Author

def Jan 20, 2025 Maintainer

miran248 Jan 20, 2025 Author

def Jan 21, 2025 Maintainer

miran248 Jan 21, 2025 Author

def Jan 21, 2025 Maintainer

miran248 Jan 21, 2025 Author

miran248 Jan 21, 2025 Author

def Jan 21, 2025 Maintainer

miran248
Jan 20, 2025

def
Jan 20, 2025
Maintainer

miran248
Jan 20, 2025
Author

def
Jan 20, 2025
Maintainer

miran248
Jan 20, 2025
Author

def
Jan 21, 2025
Maintainer

miran248
Jan 21, 2025
Author

def
Jan 21, 2025
Maintainer

miran248
Jan 21, 2025
Author

miran248
Jan 21, 2025
Author

def
Jan 21, 2025
Maintainer