Skip to content

[sled-agent] Integrate config-reconciler #8064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: john/sled-agent-config-reconciler-1
Choose a base branch
from

Conversation

jgallagher
Copy link
Contributor

This PR integrates the new sled-agent-config-reconciler crate with sled-agent. It will not currently pass tests due to the reconciler not being completely implemented, but I'd like to get any feedback on this integration work itself (particularly as it pertains to the API of sled-agent-config-reconciler). See the description of #8063 for more context.

There are a couple serious warts with this PR:

  • The inventory system has not been updated with all the details we need to report for the reconciler. This is a bigger chunk of work because it involves a database migration and touches various bits of Nexus, so I'll do that in a separate PR.
  • This integration removes most uses of the StorageManager (because its functionality is being absorbed into sled-agent-config-reconciler); however, the storage manager also has a rich set of test support. This PR leaves a couple sled-agent submodules using that test support (support-bundle/storage and zone-bundle). In the long run I think it'd be better to rework these (if there are no remaining production uses of StorageManager), but for now I think this is... okay? Feedback welcome.

jgallagher added a commit that referenced this pull request Apr 29, 2025
This is somewhat extracted from #8064, but can be landed independently
and will make some of the followup sled-agent-config-reconciler PRs a
little cleaner.

Fixes #7774.
@@ -34,14 +34,6 @@ enum SledAgentCommands {
#[clap(subcommand)]
Zones(ZoneCommands),

/// print information about zpools
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you expecting that inventory will supplant this info? Or are you planning on replacing this access to the sled agent later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that inventory would supplant this. (I think maybe it already has, in practice? I definitely only look at inventory when I'm curious about zpools; I don't think I've ever used these omdb subcommands.)

Comment on lines +377 to +378
// TODO this inventory needs to be more robust for the reconciler;
// this is nontrivial so coming in a subsequent PR. For now just
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit, link to PR, so this comment doesn't get lost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR isn't open yet, so hard to link. 😅 However, I put a stub in RSS on this branch where we need the updated inventory:

// Wait until the config reconciler on the target sled has successfully
// reconciled the config at `generation`.
async fn wait_for_config_reconciliation_on_sled(
&self,
_sled_address: SocketAddrV6,
_generation: Generation,
) -> Result<(), SetupServiceError> {
unimplemented!("needs updated inventory")
}

which guarantees this PR can't pass the helios/deploy check until the inventory work is done.

}

#[cfg(all(test, target_os = "illumos"))]
mod illumos_tests {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're deleting a lot of tests here - do we really not care about these anymore, or do we have coverage elsewehre?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little of column a, little of column b:

  • All the methods these tests are exercising have themselves been deleted, so we're forced to remove these tests too.
  • The functionality we've removed here will live in the config-reconciler now, but the tests won't port over 1-to-1 because the mechanics are changing (e.g., ledgering is largely independent from starting zones now, which is very different from how ensure_all_omicron_zones_persistent() worked). The reconciler will have its own tests for the new mechanics that cover this same functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants