Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine gets stuck in provisioning state #291

Open
khushalmer03 opened this issue Sep 23, 2024 · 9 comments
Open

Machine gets stuck in provisioning state #291

khushalmer03 opened this issue Sep 23, 2024 · 9 comments
Labels
kind/bug Something isn't working

Comments

@khushalmer03
Copy link

What steps did you take and what happened:

  • I'm trying to provision a kubernetes cluster in proxmox VM using talos bootstrap and controlplane provider
  • I have deployed management cluster using capi-operator-system helmchart as I want to utilize GitOps using fluxCD. Below is my management cluster configurations:
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: capi-operator
  namespace: capi-operator-system
spec:
  releaseName: capi-operator
  interval: 15m
  timeout: 90s
  chart:
    spec:
      chart: "cluster-api-operator"
      version: 0.13.0
      sourceRef:
        kind: HelmRepository
        name: capi-operator
  values:
    core: "cluster-api:v1.7.5"
    infrastructure: "proxmox:v0.5.1"
    controlPlane: "talos:v0.5.6;kubeadm:v1.4.2"
    bootstrap: "talos:v0.6.5;kubeadm:v1.4.2"
    addon: "helm"
    manager:
      featureGates:
        core:
          ClusterTopology: true
          MachinePool: true
        proxmox:
          ClusterTopology: true
---
apiVersion: operator.cluster.x-k8s.io/v1alpha2
kind: IPAMProvider
metadata:
  name: proxmox-ipam
  namespace: proxmox-infrastructure-system
spec:
  version: v0.1.0-alpha.3
  fetchConfig:
    url: https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/releases/download/v0.1.0-alpha.3/ipam-components.yaml

  • I'm trying to deploy single controlPlane cluster using proxmox vm template with talos initialized in it. Below is my cluster manifest that I'm using to provision talos cluster in proxmox vm
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: talos-test
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 10.0.6.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: TalosControlPlane
    name: talos-test
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: ProxmoxCluster
    name: pride
  controlPlaneEndpoint:
    host: "10.0.1.164"
    port: 6443
  
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: ProxmoxCluster
metadata:
  name: pride
spec:
  controlPlaneEndpoint:
    host: "10.0.1.164"
    port: 6443
  ipv4Config:
    addresses: [10.0.1.174-10.0.1.175]
    prefix: 20
    gateway: 10.0.1.1
  dnsServers: [10.0.1.1]
  allowedNodes: [px1]
  credentialsRef:
    name: "pride-proxmox-credentials"

---
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: talos-test
spec:
  version: v1.30.1
  replicas: 1
  infrastructureTemplate:
    kind: ProxmoxMachineTemplate
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    name: talos-cp
  controlPlaneConfig:
    init:
      generateType: init
    controlplane:
      generateType: controlplane
      talosVersion: v1.7.4
        
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: ProxmoxMachineTemplate
metadata:
  name: "talos-cp"
spec:
  template:
    spec:
      sourceNode: "px1"
      templateID: 110
      format: "qcow2"
      full: true
      numSockets: 1
      numCores: 2
      memoryMiB: 2048
      disks:
        bootVolume:
          disk: scsi0
          sizeGb: 8
      network:
        default:
          bridge: vmbr0
          model: virtio

Now the machine is created successfully in proxmox with IP assigned to it as well but machine phase in the management cluster is stuck in provisioning state as a result no further action of bootstrapping by talos takes place as it keeps waiting for infrastructure to be ready. Upon cheking the logs of capmox-controller-manager, this is what I found:

E0923 08:06:57.505071       1 controller.go:329] "Reconciler error" err="failed to reconcile VM: error waiting for agent: the operation has timed out" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="talos-cp2-n9h6b" name="talos-cp2-n9h6b" reconcileID="f55e43c1-ba7d-43f5-bbee-8c1fd90b3202"

Note that I have already enables qemu-agent in the VM template as well.

Status of machine in management cluster:

image

What did you expect to happen:
Machine is provisioned successfully and control plane is initialized.

Environment:

  • Cluster-api-provider-proxmox version:v0.5.1
  • Kubernetes version: (use kubectl version):v1.30.1
  • OS (e.g. from /etc/os-release): talos:v1.7.4
@khushalmer03 khushalmer03 added the kind/bug Something isn't working label Sep 23, 2024
@mcbenjemaa
Copy link
Member

Please share the cluster manifest, you're using and let us debug

@rouke-broersma
Copy link

rouke-broersma commented Nov 22, 2024

I have the same issue I think. Relevant capmox logs:

E1122 21:34:46.708828       1 proxmoxmachine_controller.go:209] "error reconciling VM" err="unable to get cloud-init status: no pid returned from agent exec command" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="embla-cluster/embla-proxmox-control-plane-machine-template-8tt4r" namespace="embla-cluster" name="embla-proxmox-control-plane-machine-template-8tt4r" reconcileID="b447c5f3-11b6-41bc-a800-b7a75182246e" machine="embla-cluster/embla-talos-control-plane-qfhts" cluster="embla-cluster/embla-cluster"
E1122 21:34:46.709446       1 controller.go:329] "Reconciler error" err="failed to reconcile VM: unable to get cloud-init status: no pid returned from agent exec command" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="embla-cluster/embla-proxmox-control-plane-machine-template-8tt4r" namespace="embla-cluster" name="embla-proxmox-control-plane-machine-template-8tt4r" reconcileID="b447c5f3-11b6-41bc-a800-b7a75182246e"

Templates: https://github.com/broersma-forslund/homelab/tree/90723ca8735a855e537ac792ce102a8ca5260f8f/apps/infrastructure/embla-cluster/templates

Possibly related to #290? This specifically mentions talos, but seems like this crd version is not yet released. Any chance we could soon get a release with these fixes so we can deploy this with talos?

Edit:

I switched to using a cloud-init compatible talos image, however it seems like the cloud-init config is crashing talos:

image

Seems to be an issue in talos: siderolabs/talos#9352

@rouke-broersma
Copy link

Can confirm that the issue I mentioned has been solved in talos 1.9 alpha.3. The only remaining issue I see is that capmox is not updating the node IPs in the machine CR which cause talos to wait with bootstrapping. This can be solved with the skipQemuCheck in the proxmoxmachine CR but this has not yet been released, it's on main only.

@khushalmer03
Copy link
Author

@rouke-broersma Can you share working manifests ?

@rouke-broersma
Copy link

@khushalmer03
Copy link
Author

@rouke-broersma Thanks !! I used one of the forked releases that supports skipQemuCheck in the proxmoxmachine yet the nodeIP is not being updated even when it does shows the IP from IPAM provider as a label for proxmox VM. Can it be the issue with the template being used for machine creation? Because I am using template created out of runnig talos instance as proxmox image builder doesn't have option for building talos image.

image

@rouke-broersma
Copy link

rouke-broersma commented Dec 4, 2024

I also don't use image builder, I'm pretty sure that's only for kubeadm. Did you use a nocloud type talos image? Did you also disable the cloud init check? Only qemu is not enough.

You should check the controllers (proxmox and takos) logs to see which controller is waiting on which status.

@khushalmer03
Copy link
Author

@rouke-broersma

  • I did used talos v1.9.0-alpha.3
  • Disabled both cloud init check and qemu check as well
  • Logs of proxmox controller says Proxmox machine is ready and machines are ready as well
    But taloscontrolplane says no such route to host as it is trying to access cluster on the IP that IPAM provider should ideally assigns from the InClusterIPPool
  • But the nodes is getting the IP from my DHCP and not from the IPAM

Full configuration:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: capi-cluster
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 10.244.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: TalosControlPlane
    name: capi-talos-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: ProxmoxCluster
    name: capi-proxmox-cluster

---

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: ProxmoxCluster
metadata:
  name: capi-proxmox-cluster
spec:
  credentialsRef:
    name: pride-proxmox-credentials
  allowedNodes: [px1]
  controlPlaneEndpoint:
    host: 10.0.15.241
    port: 6443
  dnsServers: [10.0.1.1]
  ipv4Config:
    gateway: 10.0.1.1
    prefix: 20
    addresses:
    - 10.0.15.242-10.0.15.250
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: ProxmoxMachineTemplate
metadata:
  name: capi-proxmox-control-plane-machine-template
spec:
  template:
    spec:
      full: true
      sourceNode: "px1"
      templateID: 103
      format: qcow2
      numSockets: 1
      numCores: 2
      memoryMiB: 2048
      disks:
        bootVolume:
          disk: scsi0
          sizeGb: 8
      network:
        default:
          bridge: vmbr0
          model: virtio
          # vlan: 254
      checks:
        skipQemuGuestAgent: true
        skipCloudInitStatus: true
---
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: capi-talos-control-plane
spec:
  replicas: 2
  version: v1.30.3
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: ProxmoxMachineTemplate
    name: capi-proxmox-control-plane-machine-template
  controlPlaneConfig:
    controlplane:
      talosVersion: v1.9.0
      generateType: controlplane
      hostname:
        source: MachineName
      configPatches:
      - op: add
        path: /machine/network
        value:
          interfaces:
          - dhcp: true
            interface: eth0
            vip:
              ip: 10.0.15.240
      - op: add
        path: /machine/install
        value:
          extraKernelArgs:
          - net.ifnames=0

Machines:

image

Taloscontrolplanes status:

image

Note that I've given controlplane IP as 10.0.15.241 but taloscontrolplane is trying to access it on 10.0.15.242 which is IP from the ipv4Configs

@rouke-broersma
Copy link

Proxmox provider does not yet support dhcp, you need to disable dhcp. Also talos is trying to configure the node so of course it's trying to reach the node ip and not the control plane ip. Your node needs to be able to arp its assigned ip and should then be routable from your cluster api provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants