Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create custom systemd-networkd-wait-online.service override to wait on individual interfaces. (LP: #2060311) #456

Merged
merged 11 commits into from
Apr 24, 2024

Conversation

slyon
Copy link
Collaborator

@slyon slyon commented Apr 16, 2024

Description

Skip s-n-wait-online if we don't have any non-optional interfaces, using a "ConditionPathIsSymbolicLink=" checking Netplan's s-n-wait-online.service enablement symlink.

This is in favor to RequiredForOnline=yes as the behavior of upstream (pure)
systemd-networkd-wait-online.service is not mean to be used in this way.

If "RequiredForOnline=no" sd-networkd-wait-online will fully ignore the
corresponding interface and it will block/delay network-online.target if
no interfaces are "RequiredForOnline=yes" at all.

FR-7246

Note: This is a replacement for #455 that keeps compatibility with cloud-init, so it can still sort After=systemd-networkd-wait-online.service AND Before=network-online.target.

Checklist

  • Runs make check successfully.
  • Retains 100% code coverage (make check-coverage).
  • New/changed keys in YAML format are documented.
  • (Optional) Adds example YAML for new feature.
  • (Optional) Closes an open bug in Launchpad. LP#2060311

@slyon slyon changed the title Create custom systemd-networkd-wait-online.service override to wait on individual interfaces. Create custom systemd-networkd-wait-online.service override to wait on individual interfaces. (LP: #2060311) Apr 16, 2024
@slyon slyon force-pushed the s-n-wait-online branch from 91ea129 to d8fb000 Compare April 16, 2024 10:02
@slyon slyon marked this pull request as ready for review April 16, 2024 10:02
@slyon slyon requested review from daniloegea and enr0n April 16, 2024 10:02
@slyon
Copy link
Collaborator Author

slyon commented Apr 16, 2024

PPA for testing can be found in: https://launchpad.net/~slyon/+archive/ubuntu/lp2060311/+packages

@slyon slyon force-pushed the s-n-wait-online branch 2 times, most recently from 6482ac5 to 6e37de0 Compare April 16, 2024 12:18
@daniloegea
Copy link
Contributor

This seem to work well. The new 10-netplan.conf file can get overcrowded if one has lots of interfaces on Netplan but I guess it shouldn't be a problem (unless you have enough interfaces to reach the maximum number of command line arguments o.O).

I did find a case though where the behavior changes: when you have the loopback interface in your yaml. Apparently -wait-online -i lo will always get stuck. So, people with ethernets.lo in their configuration will notice that wait-online will start to block after this chance. Maybe lo should be a special case and never included in the list?

@enr0n
Copy link

enr0n commented Apr 16, 2024

This seem to work well. The new 10-netplan.conf file can get overcrowded if one has lots of interfaces on Netplan but I guess it shouldn't be a problem (unless you have enough interfaces to reach the maximum number of command line arguments o.O).

I did find a case though where the behavior changes: when you have the loopback interface in your yaml. Apparently -wait-online -i lo will always get stuck. So, people with ethernets.lo in their configuration will notice that wait-online will start to block after this chance. Maybe lo should be a special case and never included in the list?

Yeah either ignore it or would have to set a different operational state range, e.g. -o carrier. But the default behavior of systemd-networkd-wait-online (i.e. when no -i are specified on the command line) is to ignore loopback.

Copy link

@enr0n enr0n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior matches what we have discussed. I did some testing with various combinations of optional: for eth0 and wlan0 and things work as expected in all cases.

@TheRealFalcon
Copy link
Member

While I'm not seeing issues specifically related to cloud-init, this doesn't seem to be working correctly, or I'm not understanding how it is supposed to work. After install the PPA, if I reboot, regardless what I have under /etc/netplan, /run/systemd/system/systemd-networkd-wait-online.service.d/10-netplan.conf contains:

[Unit]
ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service

and nothing else.

@slyon
Copy link
Collaborator Author

slyon commented Apr 16, 2024

regardless what I have under /etc/netplan, /run/systemd/system/systemd-networkd-wait-online.service.d/10-netplan.conf contains:

[Unit]
ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service

and nothing else.

@TheRealFalcon As long as you have any network definition in /etc/netplan/ that uses networkd as a renderer and does NOT use "optional: true" you should see the corresponding interface listed in that file after calling "netplan apply" (or reboot).

@TheRealFalcon
Copy link
Member

Does it matter that I'm in an LXD container?

root@asdf:~# cat /etc/netplan/50-cloud-init.yaml 
# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    renderer: networkd
    ethernets:
        eth0:
            dhcp4: true
            match:
                macaddress: 12:34:56:78:90:ab
            # optional: true
        eth1:
            dhcp4: true
            match:
                macaddress: 00:16:3e:38:c3:9e
            # optional: false
    version: 2
root@asdf:~# netplan apply
root@asdf:~# cat /run/systemd/system/systemd-networkd-wait-online.service.d/10-netplan.conf 
[Unit]
ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service
root@asdf:~# reboot
root@asdf:~# 
Session terminated, killing shell...
 ...killed.
(cloud-init37) cow:net $ lxc shell asdf
root@asdf:~# cat /run/systemd/system/systemd-networkd-wait-online.service.d/10-netplan.conf 
[Unit]
ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service
root@asdf:~# 

@slyon
Copy link
Collaborator Author

slyon commented Apr 16, 2024

Does it matter that I'm in an LXD container?

It worked in an LXD container for me earlier today. But it could have some impact on the macaddress matching.. can you try with a different matching condition, like name or driver? Looks like it doesn't find the interface here. I need to double-check that.

@TheRealFalcon
Copy link
Member

@slyon , doh, sorry. This was after I tore down a previous container and forgot to update my network configuration. Once I corrected the MAC address, it worked for me.

@TheRealFalcon
Copy link
Member

TheRealFalcon commented Apr 16, 2024

Looks like cloud-init is still having issues first boot. During our init-local timeframe, we write out netplan config, then call netplan generate. After boot is finished, checking the systemd-networkd-wait-online.service, systemd says we need to call daemon-reload.

This can be simulated in a container by running cloud-init clean --logs; rm /etc/netplan/50-cloud-init.yaml, then reboot.

After boot:

root@asdf:~# ls /run/systemd/generator/
getty.target.wants  local-fs.target.wants  netplan.stamp

root@asdf:~# systemctl status systemd-networkd-wait-online.service 
Warning: The unit file, source configuration file or drop-ins of systemd-networkd-wait-online.service changed on disk. Run 'systemctl daemon-reload' to reload units.
○ systemd-networkd-wait-online.service - Wait for Network to be Configured
     Loaded: loaded (/usr/lib/systemd/system/systemd-networkd-wait-online.service; enabled; preset: enabled)
    Drop-In: /run/systemd/system/systemd-networkd-wait-online.service.d
             └─10-netplan.conf
     Active: inactive (dead)
  Condition: start condition unmet at Tue 2024-04-16 21:40:25 UTC; 54s ago
             └─ ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service was not met
       Docs: man:systemd-networkd-wait-online.service(8)

Apr 16 21:40:25 asdf systemd[1]: systemd-networkd-wait-online.service - Wait for Network to be Configured was skipped because of an unmet condition check (ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait>

root@asdf:~# systemctl daemon-reload

root@asdf:~# ls /run/systemd/generator/
getty.target.wants  local-fs.target.wants  multi-user.target.wants  netplan.stamp  network-online.target.wants
root@asdf:~# 

@daniloegea
Copy link
Contributor

Another thing I've found is that -wait-online blocks if the interface doesn't exist. So if you have interfaces in your YAMLs that are not present in the system it will block now. The previous behavior is to not block.

@slyon slyon force-pushed the s-n-wait-online branch 5 times, most recently from 98d7f8a to 931c634 Compare April 17, 2024 12:10
@slyon slyon requested a review from daniloegea April 17, 2024 12:16
@slyon
Copy link
Collaborator Author

slyon commented Apr 17, 2024

Thanks for all of your testing and comments!

I put all fixes into follow-up commits, so it can be reviewed more easily. @daniloegea PTAL at the most recent commits.
I've built the current version in my PPA as ~ppa6, and it is fixing:

  • no blocking on loopback interface
  • no blocking on non-existing interfaces
  • no need for manual daemon-reload in cloud-init "init-local" timeframe
    • that is now covered by the netplan generate CLI, which is being called anyway
  • Interfaces with IP configuration (dhcp4, dhcp6, static IP or link-local IP) will be listed as -i IFACE:degraded
    • Interfaces without IP configuration will be listed as -i IFACE:carrier, to avoid blocking on non-existing IPs
  • Fixed typos
  • no fiddling with systemd-networkd-wait-online.service.d/10-netplan.conf override when no networkd-based NetDefs are defined in Netplan's configuration.
    • We do not enable systemd-networkd.service in this case, so we should not modify s-n-wait-online.service either.
  • Included integrationtest/autopkgtest, checking for MAC address matching, too, on a real system/VM
    • see ethernets.py -> test_systemd_networkd_wait_online

@TheRealFalcon I addressed the daemon-reload failure, and your reproducer seems to be fine now. Would this current state be acceptable to be uploaded into noble-proposed for cloud-init?

root@well-pug:~# cloud-init clean --logs; rm /etc/netplan/50-cloud-init.yaml
root@well-pug:~# reboot

$ lxc shell well-pug 

root@well-pug:~# ls /run/systemd/generator/
getty.target.wants  local-fs.target.wants  multi-user.target.wants  netplan.stamp  network-online.target.wants
root@well-pug:~# systemctl status systemd-networkd-wait-online.service 
● systemd-networkd-wait-online.service - Wait for Network to be Configured
     Loaded: loaded (/usr/lib/systemd/system/systemd-networkd-wait-online.service; enabled; preset: enabled)
    Drop-In: /run/systemd/system/systemd-networkd-wait-online.service.d
             └─10-netplan.conf
     Active: active (exited) since Wed 2024-04-17 13:05:40 UTC; 18s ago
       Docs: man:systemd-networkd-wait-online.service(8)
    Process: 301 ExecStart=/lib/systemd/systemd-networkd-wait-online -i eth0:degraded (code=exited, status=0/SUCCESS)
   Main PID: 301 (code=exited, status=0/SUCCESS)
        CPU: 6ms

Apr 17 13:05:39 well-pug systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured.
Apr 17 13:05:40 well-pug systemd[1]: Finished systemd-networkd-wait-online.service - Wait for Network to be Configured.

Copy link
Contributor

@daniloegea daniloegea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing some testing based on you comments and I noticed that dhcp is not considered to set the interface as degraded. So if I disabled link-local and enable dhcp4 it will be set to carrier. Is that expected?

@@ -1460,29 +1474,41 @@ _netplan_netdef_write_networkd(
gboolean
_netplan_networkd_write_wait_online(const NetplanState* np_state, const char* rootdir)
{
// Set of all current network interfaces, potentially non yet renamed
GHashTable* system_interfaces = g_hash_table_new_full(g_str_hash, g_str_equal, g_free, NULL);
_netplan_query_system_interfaces(system_interfaces);
Copy link

@blackboxsw blackboxsw Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain here about the impact of ignoring devices not currently present in /sys/class/net during systemd generator timeframe. By only considering interfaces already present in /sys/class/net, are we exposed to scenarios where the device drivers for the specified device haven't loaded yet (in initramfs-less environments without the proper drivers compiled into the kernel)? This reminds me of this cloud-init issue canonical/cloud-init#4451 for minimal images. In the event that we don't have device drivers loaded (from initramfs or kernel static compiled), is it possible netplan's generatornetplan generate may not see the device in /sys/class/net via if_nameindex and systemd-networkd-wait-online will be disabled right? Maybe in this case, we prescribe that initramfs/kernel must have the desired device drivers in order to expect systemd-networkd-wait-online to block properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar (the same?) issue will happen with virtual devices. By the time the generator runs they will not be created yet so they will not be in the list.

Copy link

@blackboxsw blackboxsw Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James corrected me, this isn't systemd generator timeframe, but the point in time when netplan generate is called. For cloud-init init-local boot stage, I think we may still be exposed to this race where cloud-init local gets in and invokes netplan generate possibly before supplemental device drivers have loaded. So this may still be a concern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! That's a very interesting case.. I just pushed more commits to not require virtual devices (bridges, bonds, dummys, ...) to be available in /sys/class/net just yet. Those will be created later.

For physical devices that are not yet loaded, I don't know. I don't think there's anything that can be done as part of this PR and such interfaces might be ignored by systemd-networkd-wait-online.

A longer-term solution might be splitting up Netplan into a sytemd-generator and a systemd-service (running Before=network-pre.target, similar to systemd-network-generator.service), which we are tracking as spec "FO165 - Netplan generator architecture".

Do you think this should be a blocker?

Copy link

@blackboxsw blackboxsw Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this retains the same gap/issue cloud-init and systemd-networkd-wait-online are already exposed to, so I don't think this needs to be fixed in this PR. The stance we've taken already Ubuntu minimal case was to recompile the kernel to include the module built-ins that are required to support virtio-net to ensure they are loaded in time for network generation and network-online for "initrdless" boots. It feels "ok" to state that if you have a critical device driver needed in early boot for your cloud image and you are booting without initrd, you would either need to statically compile the necessary device drivers into that kernel, or use initrd and provide any supplemental drivers needed earlier boot. This may just be something we take into consideration for future behavior though, and thinking of how to approach this type of concern later in F0165 probably makes the most sense at the moment.

@slyon
Copy link
Collaborator Author

slyon commented Apr 17, 2024

Added more commits fixing (uploaded as ~ppa7):

  • considering dhcp4: true and dhcp6: true for "degraded" operational state.
  • do not require virtual devices to be created already

@daniloegea
Copy link
Contributor

Thanks, Lukas. The cases I've found seem to be fixed now.

@slyon
Copy link
Collaborator Author

slyon commented Apr 18, 2024

Added a tiny fix to recognize that we do not configure link-local IPs on bridge/bond members. (and of course, a codestyle typo as a follow-up ;-))

@slyon slyon force-pushed the s-n-wait-online branch from 347d3fd to fbf022a Compare April 18, 2024 12:18
@slyon slyon added the stable Might be merged in a stable branch label Apr 18, 2024
slyon added 11 commits April 24, 2024 12:41
…aces only

Skip s-n-wait-online if we don't have any non-optional interfaces, using a
"ConditionPathIsSymbolicLink=" checking Netplan's s-n-wait-online.service
enablement symlink.

This is in favor to RequiredForOnline=yes as the behavior of upstream (pure)
systemd-networkd-wait-online.service is not mean to be used in this way.

If "RequiredForOnline=no" sd-networkd-wait-online will fully ignore the
corresponding interface and it will block/delay network-online.target if
no interfaces are "RequiredForOnline=yes" at all.

FR-7246
…nal state for interfaces without IP configuration
@slyon slyon force-pushed the s-n-wait-online branch from fbf022a to 1cb2129 Compare April 24, 2024 10:47
@slyon slyon merged commit eff665f into canonical:main Apr 24, 2024
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stable Might be merged in a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants