Single-Node Talos Upgrade Fails with Rook Ceph (Node not Draining) #10169

muhlba91 · 2025-01-20T10:20:02Z

muhlba91
Jan 20, 2025

Hi,

I cannot upgrade Talos (currently on 1.8.2) because the cluster is not shutting down all pods properly, and Talos is reverting the upgrade.

The upgrade command is talosctl upgrade --nodes 10.0.50.1 --image factory.talos.dev/installer/01afe9cdcc0d4f3c7de8b551795019845eed0eafcf87aa2dd264af999aabc9a0:v1.9.2 --preserve --timeout=2h0m0s.

Upon issuing the command, Talos tries to drain the node and gets stuck at the Ceph provisioners and pods with PVs because they are throwing errors about not being able to reach the Ceph cluster. This makes sense because the Ceph cluster drains pretty fast, and it seems this causes the other pods to not be able to be terminated.

At the end of the drain, I have pods stuck in running or terminating that need PVs, Ceph provisioners still running, and the Nvidia drivers running. I can manually force terminate the pods needing PVs; however, the other pods are DaemonSets, unaffected by the draining, and keep running.

This, in turn, leads to the cluster not draining entirely and Talos reverting the upgrade.

Has anyone else run into such issues, or does anyone know how to get this working?

smira · 2025-01-20T10:39:40Z

smira
Jan 20, 2025
Maintainer

You can try doing --force which would prevent draining iirc.

But the best way is to submit an issue with full logs and a way to reproduce.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-Node Talos Upgrade Fails with Rook Ceph (Node not Draining) #10169

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Single-Node Talos Upgrade Fails with Rook Ceph (Node not Draining) #10169

muhlba91 Jan 20, 2025

Replies: 1 comment

smira Jan 20, 2025 Maintainer

muhlba91
Jan 20, 2025

smira
Jan 20, 2025
Maintainer