-
Notifications
You must be signed in to change notification settings - Fork 43
Introduce sled-agent-config-reconciler skeleton #8063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
use sled_storage::config::MountConfig; | ||
use tufaceous_artifact::ArtifactHash; | ||
|
||
#[allow(async_fn_in_trait)] // TODO confirm this makes sense and explain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't seen this before - is there a good reason to not use:
- Async-trait, or
- To follow the advice of https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lint/async_fn_in_trait/static.ASYNC_FN_IN_TRAIT.html , and write out "-> impl Future" instead of "async fn"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about good reason, but there are reasons:
- async-trait boxes the futures returned in traits; IIUC built-in async trait methods are not boxed.
- We're in the case mentioned at the bottom of that link:
if it is only used in single-threaded contexts that do not care whether the returned Future implements Send, then the lint may be suppressed
If we replaced these with -> impl Future
, we wouldn't need to add any bounds, because our consumption of this trait never does anything fancy like tokio::spawn()
ing them; it just calls them inline inside other already-async functions. (I believe this is true but haven't gotten far enough along the followup work to confirm it, hence the comment to come back and explain it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I suspect not having the Send
bound might prove to be problematic in the future.
https://crates.io/crates/trait-variant can help put those Send
bounds on. Due to the way auto trait leakage works, implementors don't have to use a proc macro -- they can just write async fn
and everything works great. (This is the same trick that Dropshot API traits use.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re trait-variant vs async-trait, depends on if you need static vs dynamic dispatch. There's also https://github.com/spastorino/dynosaur that works with async traits for dynamic dispatch, but I haven't looked closely at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I suspect not having the Send bound might prove to be problematic in the future.
I'd be mildly surprised; the type implementing this trait itself has to be Send
, so we can give it to a long-running tokio task (the main reconciler loop), but the futures returned by all these methods are executing within the context of that task and are never themselves spawned. I think that means they're fine to not be Send
, right?
re trait-variant vs async-trait, depends on if you need static vs dynamic dispatch
We don't need dynamic dispatch in this case; we can spawn a tokio task that takes a generic without needing to bubble that generic out into the handle that talks to the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I suspect not having the Send bound might prove to be problematic in the future.
I'd be mildly surprised;
You were right and I am mildly surprised; I don't think I'd ever seen this error before:
error: future cannot be sent between threads safely
...
= help: within `impl Future<Output = ()>`, the trait `Send` is not implemented for `impl Future<Output = Result<(), anyhow::Error>>`
note: future is not `Send` as it awaits another future which is not `Send`
I'll update these to use -> impl Future<_> + Send
. (I'm also going to change the generic error bound types to anyhow::Error
for ease of forwarding those back out of this crate.)
/// Called once time is synchronized. | ||
// TODO-cleanup should we do this work ourselves instead? | ||
async fn on_time_sync(&self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm: This is called to inform the sled agent that time has been synchronized? or is it an async function that returns when time has been synchronized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The former; I'll clarify the doc comment. (Ultimately the TODO
comment is probably more relevant: the reconciler should just do the work once it realizes time has sync'd for the first time. But I'm trying my best to keep the overall set of changes limited whenever I can.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded in 01ebe83
@@ -0,0 +1,3 @@ | |||
// This Source Code Form is subject to the terms of the Mozilla Public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I'm assuming we'll add more in subsequent PRs, but just confirming this file isn't detritus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
unimplemented!() | ||
} | ||
|
||
pub async fn validate_artifact_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm: This won't write the artifact config, but provides an interface to access it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct; this is related to your question below about race conditions - will expand there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a doc comment and tried to summarize the concurrency issue in 01ebe83
|
||
#[derive(Debug)] | ||
enum LedgerTaskRequest { | ||
WriteNewConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "validate_artifact_config" going to happen outside the task system? I'm noticing there's no request for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, realized that as soon as I started working in this module 😓. Just an oversight; there should be a request for it.
unimplemented!() | ||
} | ||
|
||
pub async fn inventory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can punt on this question if it's answered in a follow-up PR, but with no rustdocs on a pub method, I figured I'd ask: why is the zpools
argument being passed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a little messy. It's because the implementation of this is basically
omicron/sled-storage/src/manager.rs
Lines 1165 to 1181 in bac3385
let datasets_of_interest = [ | |
// We care about the zpool itself, and all direct children. | |
zpool.to_string(), | |
// Likewise, we care about the encrypted dataset, and all | |
// direct children. | |
format!("{zpool}/{CRYPT_DATASET}"), | |
// The zone dataset gives us additional context on "what zones | |
// have datasets provisioned". | |
format!("{zpool}/{ZONE_DATASET}"), | |
]; | |
info!(log, "Listing datasets within zpool"; "zpool" => zpool.to_string()); | |
Zfs::get_dataset_properties( | |
datasets_of_interest.as_slice(), | |
WhichDatasets::SelfAndChildren, | |
) | |
.map_err(Error::Other) |
FWIW, I think this method isn't really pub
, because the type it's on isn't directly exposed outside the crate. But it's still a good question! Once the rest of the work starts to land I wonder if there's a way to not need to pass this argument.
/// Validate that a new artifact config is legal (i.e., it doesn't remove | ||
/// any artifact hashes in use by the currently-ledgered sled config). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any concern about a TOCTTOU here? E.g., we check if things are okay, then someone else updates the sled config, then we write the (now bad) artifact config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so ultimately the sled config ledgering task and the artifact store both need to be able to talk to each other:
- When a new artifact config comes in, the artifact store needs to confirm with the sled config ledgering task that the new artifact config is okay.
- When a new sled config comes in, the sled config ledgering task needs to confirm with the artifact store that all the artifacts that new sled config references actually exist.
And both of these cross-checks have to be safe if both types of new config arrive concurrently (and possibly with contradictory changes).
The way I'm approaching this is:
- The sled config ledgering task is always the one that does both kinds of confirmation.
- The artifact store is given a handle that lets it send the
validate_artifact_config
request to the sled-config ledgering task and wait for a response. - The sled-config ledgering task is given a handle to the artifact store (via it implementing
SledAgentArtifactStore
, which has the single methodvalidate_artifact_exists_in_storage()
).
Because of item 1, we serialize any potential concurrency: if the sled-config ledgering task is busy checking a new sled config, any request from the artifact store (item 2) will block; similarly, if the sled-config ledgering task is busy validating a new artifact config via a request from the artifact store, any requests to set a new sled-config will block.
I am assuming in all of this that it's sufficient to only check with the ledgering task. It's possible the reconciliation task that acts on the ledger might still be acting on an old config; I think this is exceedingly unlikely in practice and would sort itself out on its own with just some spurious errors if it did happen, but am not sure how to confirm that. An example sequence:
- We get a new sled config (generation
N
) that adds a zone. This artifact for the zone is present in the artifact store, so we ledger the new config. This wakes up the reconciler task, but imagine it gets blocked before it actually starts the zone. - We get a new sled config (generation
N+1
) that removes the zone added in generationN
. The new config is ledgered. This will wake up the reconciler task again once it finishes reconciling generationN
. - We get a new artifact config that removes the artifact needed for the zone added in
N
and removed inN+1
. The artifact store checks with the ledgering task; this artifact is not needed in the currently-ledgered config (N+1
), so we accept the new config. The artifact store removes this artifact. - The reconciler task gets unblocked and resumes trying to reconcile generation
N
. Launching the new zone will fail, because the artifact is now gone. (This seems bad, but I think it's okay? It fails because we've already ledgered a new config that removes that zone.) - The reconciler task finishes reconciling generation
N
, and immediately sees that generationN+1
has already arrived, so starts reconciling that. This wants to remove the zone that we failed to start anyway, which is fine.
(cc @iliana to double check all of this too!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That flow seems fine to me, assuming that failing to launch that zone is actually okay. Does that stop other zones from launching? Does it matter if we're about to reconcile N+1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not stop other zones from launching, no. I guess my only concern would be how the failure manifests itself if we get some really bad timing (e.g., we issue the zoneadm install ...
and then the artifact store removes the file while it's installing). Presumably this will either succeed (if it had already progressed far enough to have read / copied / whatever the file) or fail similarly to how it would if the file weren't there at all; if that's right this should all be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be possible to deadlock with both steps 2 and 3 waiting on each other if they are 2 tasks talking to each other over blocking channels. Probably want to use try_send
for this case because of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be possible to deadlock with both steps 2 and 3 waiting on each other if they are 2 tasks talking to each other over blocking channels. Probably want to use
try_send
for this case because of that.
Ooh, great point I hadn't thought about. I was already planning to use try_send
, but not with a channel depth of 0, so I don't think that helps (since try_send
could succeed in enqueuing the request, then we block waiting for the response, but the other task gets blocked trying to talk to us while processing an earlier request).
However, I think because of the way the artifact store is implemented, we're actually okay. Linking to changes on #8064:
The ArtifactStore
implementation for validating that an artifact exists (i.e., the one the reconciler task calls when checking a new sled config) calls ArtifactStore::get()
:
omicron/sled-agent/src/sled_agent.rs
Lines 1377 to 1383 in abd7542
async fn validate_artifact_exists_in_storage( | |
&self, | |
artifact: ArtifactHash, | |
) -> Result<(), Self::ArtifactExistsValidationError> { | |
self.0.get(artifact).await?; | |
Ok(()) | |
} |
get
calls get_from_storage
, which does filesystem operations directly (i.e., no channels involved):
omicron/sled-agent/src/artifact_store.rs
Lines 245 to 289 in abd7542
/// Open an artifact file by hash. | |
/// | |
/// Also the GET operation (served by Repo Depot API). | |
/// | |
/// We try all datasets, returning early if we find the artifact, logging | |
/// errors as we go. If we don't find it we return the most recent error we | |
/// logged or a NotFound. | |
pub(crate) async fn get( | |
&self, | |
sha256: ArtifactHash, | |
) -> Result<File, Error> { | |
Self::get_from_storage(&self.storage, &self.log, sha256).await | |
} | |
/// Open an artifact file by hash from a storage handle. | |
/// | |
/// This is the same as [ArtifactStore::get], but can be called with only | |
/// a storage implementation. | |
pub(crate) async fn get_from_storage( | |
storage: &T, | |
log: &Logger, | |
sha256: ArtifactHash, | |
) -> Result<File, Error> { | |
let sha256_str = sha256.to_string(); | |
let mut last_error = None; | |
for mountpoint in storage.artifact_storage_paths().await { | |
let path = mountpoint.join(&sha256_str); | |
match File::open(&path).await { | |
Ok(file) => { | |
info!( | |
&log, | |
"Retrieved artifact"; | |
"sha256" => &sha256_str, | |
"path" => path.as_str(), | |
); | |
return Ok(file); | |
} | |
Err(err) if err.kind() == ErrorKind::NotFound => {} | |
Err(err) => { | |
log_and_store!(last_error, &log, "open", path, err); | |
} | |
} | |
} | |
Err(last_error.unwrap_or(Error::NotFound { sha256 })) | |
} |
I think this means we can't deadlock here, right? ArtifactStore
can block waiting for a request/response from the reconciler task, but not vice versa. I'm not sure how fragile it is to rely on this, though. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with try_send
this will always be fine. The key point is that the sender won't ever block from receiving new requests. Even with a channel this is the case, and the queued messages will eventually be processed. This is definitely true in the case of the artifact store since it bottoms out without another remote call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to avoid this communication pattern altogether, I think you need to create a tree where there's a third task that both tasks reach out to. That may be the right thing to do, but it certainly shouldn't hold up this PR.
// Using `!self.is_boot_disk` as the first part of the ID allows us to | ||
// guarantee we always iterate over the internal disks with the current | ||
// boot disk first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this relying on "false" being ordered before "true"? so for the boot disk, "!true" is "false" and comes first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Gross, right? That's why the comment about adding tests. Maybe this should be less clever and use a more explicit type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an explicit ID type and a proptest confirming the sorting in 608905a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. The split in abstractions looks right to me.
@@ -1535,7 +1540,7 @@ mod tests { | |||
logctx.log.clone(), | |||
tokio::sync::mpsc::channel(1).1, | |||
); | |||
let tempdir = TempDir::new().unwrap(); | |||
let tempdir = Utf8TempDir::new().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning this up here.
@@ -0,0 +1,3 @@ | |||
// This Source Code Form is subject to the terms of the Mozilla Public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda odd to include an empty file in this PR. I guess it's just to show that this code is coming similar to internal_disks.rs
.
@@ -0,0 +1,80 @@ | |||
// This Source Code Form is subject to the terms of the Mozilla Public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comment. Thank you!
|
||
/// Spawn the primary config reconciliation task. | ||
/// | ||
/// This method can effectively only be called once; any subsequent calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, when I come across something like this, I return an error and then unwrap it at the caller. That let's us know it's happening immediately. I get that this can introduce a crash in production if called from some obscure code path, and that's why logging and ignoring seems "safer".
The problem with ignoring and just logging however is that the second call may introduce a change to the parameters which will now be ignored, where the user may have actually wanted to remove the first call.
Both of these seem relatively unlikely, but I usually want to know right away if a call I've made is going to silently fail. In any case, it's not super important, so I'll default to your judgement about whether to keep it the way it is or change things.
Another option I just thought of is to return an opaque token with the call to new
that gets consumed when spawn_reconciliation_task
is called. Make it impossible to create this token outside this module. You could actually include the reconciler task dependencies inside the token, and have them unwrapped on the call to spawn_reconcilliation_task
. This would prevent the need for the reconciler_task_dependencies
field altogether and make it statically impossible to call spawn_reconcilliation_task
twice. The downside of this approach is having to carry the token outside the handle. You could work around this, by keeping the reconciler_task_dependencies
field, and having the caller retrieve it in a token if it's available, and then pass that into spawn_reconcilliation_task
. However, this seems morally equivalent to returning an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could actually include the reconciler task dependencies inside the token, and have them unwrapped on the call to
spawn_reconcilliation_task
. This would prevent the need for thereconciler_task_dependencies
field altogether and make it statically impossible to callspawn_reconcilliation_task
twice. The downside of this approach is having to carry the token outside the handle.
I like this idea a lot. Let me see how painful it is to carry the token around inside sled-agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
/// Validate that a new artifact config is legal (i.e., it doesn't remove | ||
/// any artifact hashes in use by the currently-ledgered sled config). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be possible to deadlock with both steps 2 and 3 waiting on each other if they are 2 tasks talking to each other over blocking channels. Probably want to use try_send
for this case because of that.
use tufaceous_artifact::ArtifactHash; | ||
|
||
#[allow(async_fn_in_trait)] // TODO confirm this makes sense and explain | ||
pub trait SledAgentFacilities: Send + 'static { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great little abstraction.
This PR adds a new
sled-agent-config-reconciler
crate. The majority of its public API is present here, but most of the guts of it is stubbed out viaunimplemented!()
. This is the first PR of many in a big chunk of work, and I'm not sure how best to divide it, but my current plan is:unimplemented!()
bits inside the reconcilerThis PR is safe to land on
main
: the only changes to sled-agent proper are that thedump_setup
module has moved to this crate (with some very minor changes that don't affect behavior), so sled-agent calls it from this crate instead of from its own submodule. The rest of the crate with all theunimplemented!()
paths is not called by anyone.The PRs in group 3 will similarly be safe to land: they will only flesh out the
unimplemented!()
-but-not-called code paths.PR 2 (which I'll open shortly) will not pass tests until all the PRs from group 3 have landed, but I'd like to open them in this order to get eyes on that integration work, because it may inform changes we want to make to the config-reconciler crate's API (which would be easier now than after landing everything in group 3).