Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omit cloud-provider-name if rke_disable_cloud_controller is enabled #273

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

edward2a
Copy link
Contributor

@edward2a edward2a commented Nov 7, 2024

Otherwise the taint "node.cloudprovider.kubernetes.io/uninitialized" is applied and cluster does not finish setup as nodes are unschedulable.

Description

Started a test cluster in virtualbox (a.k.a on-prem, 1 master, 1 worker) and there was pending jobs and pods (cluster did not finish setup).

On inspection, the worker was cordoned with "node.cloudprovider.kubernetes.io/uninitialized".

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Small minor change not affecting the Ansible Role code (GitHub Actions Workflow, Documentation etc.)

How Has This Been Tested?

Deploy into a two-node cluster and check pods afterwards to verify there are no Pending pods (jobs or runtime components).

otherwise the taint "node.cloudprovider.kubernetes.io/uninitialized" is
applied and cluster does not finish setup as nodes are unschedulable.
@MonolithProjects MonolithProjects self-assigned this Nov 20, 2024
@MonolithProjects MonolithProjects added the bug Something isn't working label Nov 20, 2024
Copy link
Collaborator

@MonolithProjects MonolithProjects left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MonolithProjects MonolithProjects merged commit bcb089f into lablabs:main Nov 20, 2024
6 checks passed
@Raboo
Copy link
Contributor

Raboo commented Nov 27, 2024

This achieved the opposite. A brand new cluster gets node.cloudprovider.kubernetes.io/uninitialized taint with default settings.

You should remember that rke2_disable_cloud_controller is a negative, i.e. when false it enables the RKE2 cloud controller. When true it disables RKE2 cloud controller.

So the code before this PR was correct.

{% if (rke2_disable_cloud_controller | bool ) %}
disable-cloud-controller: true
cloud-provider-name: "{{ rke2_cloud_provider_name }}"
{% endif %}

vs

{% if (rke2_disable_cloud_controller | bool ) %}
disable-cloud-controller: true
{% endif %}
{% if not (rke2_disable_cloud_controller | bool ) %}
cloud-provider-name: "{{ rke2_cloud_provider_name }}"
{% endif %}

Your code is trying to use an external cloud contoller when the internal one is enabled.

@edward2a @MonolithProjects

This should be reverted.

Raboo added a commit to Raboo/ansible-role-rke2 that referenced this pull request Nov 27, 2024
Revert lablabs#273.
It actually does the opposite of what it says it's trying to achieve
@edward2a
Copy link
Contributor Author

@Raboo Did I miss up on the logic?

Original:

# default, cloud controller is enabled and cloud-provider-name is not set
if false:
    # nothing is printed
fi
# cloud controller is disabled and cloud-provider-name is set (why?)
if true:
    rke2_disable_cloud_controller: true
    rke2_cloud_provider_name: "external"
fi

Updated:

# cloud controller is enabled and cloud-provider-name is defined
if false:
    # nothing is printed
fi

# cloud controller is enabled (not false is true, cloud-provider-name is printed)
if not false:
    rke2_cloud_provider_name: "external"
fi
# cloud controller is disabled and cloud-provider-name not set
if true:
    rke2_disable_cloud_controller: true
fi

# cloud controller is disabled (not true is false, cloud-provider-name is not printed)
if not true:
    # nothing is printed
fi

@edward2a
Copy link
Contributor Author

Original:

test.yml

---
- hosts: all
  connection: local
  gather_facts: false
  vars:
    rke2_disable_cloud_controller: false
    rke2_cloud_provider_name: "external"

  tasks:
  - debug:
      msg: "{{ lookup('template', './config.yml.j2') }}"

config.yml.j2

{% if (rke2_disable_cloud_controller | bool ) %}
disable-cloud-controller: true
cloud-provider-name: "{{ rke2_cloud_provider_name }}"
{% endif %}

Disable cloud controller

test.yml

...
  vars:
    rke2_disable_cloud_controller: true
    rke2_cloud_provider_name: "external"
...

Output: when "rke2_disable_cloud_controller: true", cloud-provider-name is defined, why?

@powerbook14 ~/Development/test/ansible $ ansible-playbook -i localhost, test.yml

PLAY [all] **********************************************************************************************************

TASK [debug] ********************************************************************************************************
ok: [localhost] => {}

MSG:

disable-cloud-controller: true
cloud-provider-name: "external"

PLAY RECAP **********************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

@edward2a
Copy link
Contributor Author

Updated

config.yml.j2

{% if (rke2_disable_cloud_controller | bool ) %}
disable-cloud-controller: true
{% endif %}
{% if not (rke2_disable_cloud_controller | bool ) %}
cloud-provider-name: "{{ rke2_cloud_provider_name }}"
{% endif %}

Enabled cloud controller

Output: when "rke2_disable_cloud_controller: false", no cloud-provider-name is defined.

@powerbook14 ~/Development/test/ansible $ ansible-playbook -i localhost, test.yml

PLAY [all] **********************************************************************************************************

TASK [debug] ********************************************************************************************************
ok: [localhost] => {}

MSG:

cloud-provider-name: "external"

PLAY RECAP **********************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Disabled cloud controller

Output: when "rke2_disable_cloud_controller: true", no cloud-provider-name is defined.

@powerbook14 ~/Development/test/ansible $ ansible-playbook -i localhost, test.yml

PLAY [all] **********************************************************************************************************

TASK [debug] ********************************************************************************************************
ok: [localhost] => {}

MSG:

disable-cloud-controller: true

PLAY RECAP **********************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

@Raboo
Copy link
Contributor

Raboo commented Nov 27, 2024

So you are saying when the RKE2 cloud controller is disabled, there should be a cloud-provider-name set?
Doesn't setting cloud-provider-name disable the RKE2 cloud controller?

@edward2a
Copy link
Contributor Author

So you are saying when the RKE2 cloud controller is disabled, there should be a cloud-provider-name set? Doesn't setting cloud-provider-name disable the RKE2 cloud controller?

Perhaps I am misunderstanding how these two configurations work together or I missed some changes between RKE2 versions.
What I was trying to do was to deploy without a CCM altogether, but not have the uninitialised taint set.

@Raboo
Copy link
Contributor

Raboo commented Nov 27, 2024

Well when you install without a CCM you will get the taint. Once you install a CCM, the taint should be removed afaik..

I am now confused about the cloud-provider-name.
However I do know when cloud-provider-name is set to "external" the built in CCM doesn't seem to work since RKE2 gets the node taint.

@edward2a
Copy link
Contributor Author

My scenario is bare-metal, so the cloud-controller does not seem like something useful to me, hence why I came to this issue.

When I disabled the cloud-controller I get stamped with the "external" default which causes the taint.

I am trying to find a list of values for the setting itself but I can't find such in Rancher's docs (RKE2, SUSE, etc.) other than "external" or "aws".

@Raboo
Copy link
Contributor

Raboo commented Nov 27, 2024

Well maybe they should be configured separately?
Cause now everyone that has the cloud-controller enabled (which is the default by the way) gets the taint.

@edward2a
Copy link
Contributor Author

That sounds reasonable.
Want to update your PR to I can make a new one?

@Raboo
Copy link
Contributor

Raboo commented Nov 27, 2024

I'm wondering how it should be done without introducing breaking changes.
rke2_disable_cloud_controller: false should never have cloud-provider-name I think, at least that is the old behavior.
rke2_disable_cloud_controller: true might have a cloud-provider-name, but shouldn't automatically and the default value of the variable is external.

So if we have a special value false? cloud-provider-name: false, it shouldn't add it?

something like that?

@Raboo
Copy link
Contributor

Raboo commented Nov 27, 2024

After digging throught RKE2 source code this seems the be the available choices for cloud-provider-name:

  • aws
  • azure
  • gcp
  • rancher-vsphere
  • harvester
  • external

external is special, it disables cloud_controller.

So with this knowledge it seems that cloud_controller can be enabled at the same time with cloud-provider-name.
It was also "nullable", i.e. no value.

@edward2a
Copy link
Contributor Author

edward2a commented Nov 27, 2024 via email

@Raboo
Copy link
Contributor

Raboo commented Nov 27, 2024

that would be a breaking change, @MonolithProjects what do you think? How to proceed?

@MonolithProjects
Copy link
Collaborator

Hi there, first of all thanks for your effort @Raboo and @edward2a . What about to introduce false option for rke2_cloud_provider_name (external would be still the default value) and do it like this:

{% if (rke2_disable_cloud_controller | bool ) %}
disable-cloud-controller: true
{% if rke2_cloud_provider_name != false %}
cloud-provider-name: "{{ rke2_cloud_provider_name }}"
{% endif %}
{% endif %}

I guess this should work and not be a breaking change.

@Raboo
Copy link
Contributor

Raboo commented Nov 28, 2024

Ok, That is a good solution for now, it achieves the goal that @edward2a wants, but it should be possible to enable it even when rke2_disable_cloud_controller = false as far as I understood from my latest research.

@Raboo
Copy link
Contributor

Raboo commented Nov 28, 2024

I've updated my PR now, but we should look at introducing the breaking change by controlling rke2_disable_cloud_controller and rke2_cloud_provider_name separately.

MonolithProjects added a commit that referenced this pull request Nov 28, 2024
@maxisam
Copy link
Contributor

maxisam commented Dec 18, 2024

According to the doc:

https://docs.k3s.io/networking/networking-services#deploying-an-external-cloud-controller-manager
mentioned by rancher/rke2#4268

we shouldn't disable CCM if it is running on prem. ( and set the provider name as external)

I think the original default behavior was correct.

I update the doc here as well #289

However, I don't see anything wrong so far with the following settings, everything seems like running fine

rke2_disable_cloud_controller : true and rke2_cloud_provider_name: ""

because it still set the internal ip , and I don't have external ip for my nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants