-
Notifications
You must be signed in to change notification settings - Fork 43
[nexus] track "intended" instance states #8053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When a request to a sled-agent to change the state of an instance fails in such a way that indicates that the instance is no longer present on that sled, the instance record is moved to the `Failed` state. If an instance is in the `Failed` state and auto-restart is enabled for that instance, then the control plane will always attempt to restart it. Both of these behaviors are fine in a vacuum, but in conjunction, they result in somewhat surprising behavior in the case where an instance which has failed is asked to stop. If the request that the sled-agent stop the instance fails in a way that indicates that the instance is gone, then Nexus will move the instance to `Failed` and attempt to restart it.[^1] This is, to use a technical term, wacky. The user asked that an instance not be running, we went and tried to stop it and discovered that it was, in fact, not running, and then we...decided to start it again? That's literally the opposite of the intended behavior, from the user's perspective. I described this in #6809. This commit fixes this (and lays the groundwork for some other future improvements to instance state management). The approach is as follows: 1. Introduce a new "intended state" column to database instance records, tracking the most recently requested state of the instance. This column will only record "resting" states that can be requested through the API: `Stopped`, `Running`, and `Destroyed`. 2. When handling API requests that change the state of an instance, immediately update the `intended_state` of the instance record before doing anything else (starting sagas, sending requests to sled-agents, etc.) 3. When deciding whether to automatically restart a `Failed` instance, consider the intended state as well as other variables, and only restart it if it is supposed to be `Running` Presently, this tracking of requested states is only used for determining whether to auto-restart an instance. However, it may be useful for at laest one other thing that kinda bothers me: when an instance is created with the `start: true` field in the `InstanceCreate` params, the API endpoint runs an `instance_create` saga and then, once it completes, also runs an `instance_start` saga. Both operations are performed in sagas, so if the Nexus instance that is handling the API request dies unexpectedly, the user's intent should be carried out...except this isn't _quite_ the case. If the Nexus dies while running the `instance_create` saga, or after the `instance_create` saga has completed but before the `instance_start` saga has been created and written to the database, then the instance will be created, but the user intent that the instance be *started* will not be carried out. This is a bit unfortunate, especially when the instance is created by automation against the Oxide API. A human starting an instance in the console may just notice that the instance has been created and is now in the `Stopped` state, and click on the "start" button, and just be a little bit annoyed that their instance didn't start automatically when they clicked a checkbox saying it should. But, automation against the API might somewhat reasonably expect that if an instance-create request has `start: true`, the instance will be started once it is created, and not try to start it again, resulting in the instance never actually starting. We could fix this using the tracked instance intended states as well. The completion of the `instance-create` saga could "chain" into an `instance-start` saga if the instance's intended state is `Running`, or an RPW could ensure that an `instance-start` saga exists for all instances with that intended state. I haven't implemented that in this PR, but this change lays the groundwork for that as well. [^1]: Provided that auto-restart is configured for that instance.
-- this is a bit conservative: we assume the intended state is running if and | ||
-- only if the instance is currently running, because we don't know what the | ||
-- user's desired state was otherwise. | ||
SET LOCAL disallow_full_table_scans = off; | ||
UPDATE omicron.public.instance SET intended_state = CASE | ||
WHEN state = 'destroyed' THEN 'destroyed' | ||
WHEN state = 'vmm' THEN 'running' | ||
ELSE 'stopped' | ||
END; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we could be a bit fancier here if we also considered the VMM's state if the instance has one, but I wasn't sure whether that was easily possible here...if someone better at SQL than I am has suggestions, I'm all ears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking through when this migration would happen in practice, we'd have done the normal parallel-park.sh that kills all the zones, so VMMs could be in ~any state if an instance has a VMM at all.. the issue would be if a VMM were stopping, stopped, or destroyed (and the instance itself was not in those states!), where we'd set the intended state back to running.
i think these cases are all ones we'd have unintentionally auto-restarted when the update finished anyway, so not being fancier here is pretty minor. does that sound right?
that said i think you can do something like
UPDATE instance SET intended_state = CASE
WHEN instance.state = 'vmm' AND vmm.state = 'running' THEN 'running'
...
END FROM omicron.public.instance instance LEFT JOIN omicron.public.vmm ON instance.vmm.active_propolis_id = vmm.id
but i'm certain that the syntax is wrong in some dumb way and it might not be entirely worth fighting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you're correct that at present, we will have already killed all the running VMMs when migrations are being run...I had gotten it into my head that it was worth making sure that the migration could handle a future where we didn't do that, but now that I think about it, it doesn't really make sense, since we wouldn't be upgrading from a future version that updated without killing VMMs...
As an aside, note that I did consider that perhaps we should just move the instance to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good! i'm basically a +1, just the one dbg string that definitely ought to get fixed up first
|
||
// Okay, now let's try restarting the instance, failing it, and then | ||
// asserting it *does* reincarnate. | ||
dbg!(expect_instance_start_ok(client, instance_name).await); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do kinda wonder if it makes sense for the expect_*...
helpers to trace!("starting {instance_name} and expecting it to work")
so we don't feel a need to get that kinda activity log via dbg!()
. might even work nicely if we #[track_caller] fn expect_instance_start_ok()
and used the Location to log about where the helpers were being invoked...
obviously just tossing ideas out there, dbg!()
is plenty common and this'd be pretty tangential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally, i prefer to use dbg!
for stuff that the main test function does, rather than stuffing it in the log, because everything that gets printed to stderr is displayed when the test fails, so you don't have to looker -f /tmp/test_whatever.0.some nonsense
to get it. but others may disagree with me on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah that's true, i can see not having to fish up stuff from the log being nice. as someone who vim /tmp/test_whatever.0.blah.log
i weigh my aesthetic preferences here very low :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not even running it through looker
first?
Courage is an Oxide value, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might even work nicely if we #[track_caller] fn expect_instance_start_ok() and used the Location to log about where the helpers were being invoked...
should also mention for future reference that #[track_caller]
doesn't work correctly with async fn
s, sadly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh TIL, ow ow ow.
dbg!( | ||
instance_wait_for_state(client, instance_id, InstanceState::Failed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fun fact: apparently github syntax highlighting is criminally misleading in macros? instance_wait_for_state
is not highlighted at all, so if you're skimming it looks like this is just some variable being printed (unlike function calls outside macros that get purpled) i totally missed the contents of dbg!()
are load-bearing "make the test go" stuff...
(i know this highlights reasonably locally so it only really matters if you're looking through github anyway)
for secs in 0..30 { | ||
let state = instance_get(client, &instance_url).await; | ||
assert_eq!( | ||
state.runtime.run_state, | ||
InstanceState::Failed, | ||
"instance transitioned out of Failed (to {}) after {secs} \ | ||
seconds! state: {:#?}", | ||
state.runtime.run_state, | ||
state | ||
); | ||
assert_eq!(state.runtime.time_last_auto_restarted, None); | ||
tokio::time::sleep(Duration::from_secs(1)).await; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinda nitpicky: is the loop actually necessary? we could tokio::time::sleep(Duration::from_secs(30))
and not lose that much by taking state.runtime.last_auto_restarted
as the time that auto-restart incorrectly kicked this instance right? my thought is that the secondly polls only really help if the test would fail, and in those cases we'd go look at the logs, but the interesting Nexus activity would be mixed with a bunch of us asking "Is It Wrong Yet?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♀️ yeah, maybe it's better to just poll once; this whole bit is quite unfortunate honestly. i thought failing faster was maybe worth it (versus waiting all 30s every time) but i see your point about the logs.
-- this is a bit conservative: we assume the intended state is running if and | ||
-- only if the instance is currently running, because we don't know what the | ||
-- user's desired state was otherwise. | ||
SET LOCAL disallow_full_table_scans = off; | ||
UPDATE omicron.public.instance SET intended_state = CASE | ||
WHEN state = 'destroyed' THEN 'destroyed' | ||
WHEN state = 'vmm' THEN 'running' | ||
ELSE 'stopped' | ||
END; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking through when this migration would happen in practice, we'd have done the normal parallel-park.sh that kills all the zones, so VMMs could be in ~any state if an instance has a VMM at all.. the issue would be if a VMM were stopping, stopped, or destroyed (and the instance itself was not in those states!), where we'd set the intended state back to running.
i think these cases are all ones we'd have unintentionally auto-restarted when the update finished anyway, so not being fancier here is pretty minor. does that sound right?
that said i think you can do something like
UPDATE instance SET intended_state = CASE
WHEN instance.state = 'vmm' AND vmm.state = 'running' THEN 'running'
...
END FROM omicron.public.instance instance LEFT JOIN omicron.public.vmm ON instance.vmm.active_propolis_id = vmm.id
but i'm certain that the syntax is wrong in some dumb way and it might not be entirely worth fighting
Co-authored-by: iximeow <iximeow@oxide.computer>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally LGTM, but I would add one bit: if the instance update saga detects that the instance's current VMM has moved to Stopped
, I would also change the intent (as part of the broader update) to Stopped
instead of leaving it in Running
. The idea is to cover the case where a guest shuts down without going through the API: we may not know what the user intent was in that case (was the user expecting the guest to do this?), but I think leaving the intent unchanged increases the chance that we will later misinterpret the intent in this field and start an instance that previously stopped gracefully. (It also means that omdb won't display a self-stopped instance as having "running" intent: one might ask why such an instance is stopped if the intent says it should be running!)
To be clear, I think the code in this PR behaves as intended; this is just a dose of defensive programming (and the omdb observability improvement).
// Unfortunately there isn't really a great way to assert "no start saga has | ||
// been started", so we'll do the somewhat jankier thing of waiting a bit | ||
// and making sure that the instance doesn't transition to Starting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How incredibly awful would it be to try to peer into the saga table omdb-style to try to look for one?
|
||
// Okay, now let's try restarting the instance, failing it, and then | ||
// asserting it *does* reincarnate. | ||
dbg!(expect_instance_start_ok(client, instance_name).await); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not even running it through looker
first?
Courage is an Oxide value, I suppose.
Yeah, that's a good thought, I'll make that change as well. I wonder if we want to have a different enum variant to represent that, so that we can distinguish between "stopped due to API request" and "shut down internally" in logs/omdb. |
Okay @gjcolombo, I'm looking into adding this this morning. One thing I wanted to bring up is that if we start setting a "guest shutdown" intent when the VMM transitions to |
Alternatively, I think it's probably fine to do that on the "VMM destroyed" transition instead of the "VMM stopped" transition, because in that case, we're already doing an update saga, and we won't change the instance's state to |
If by "do that" you mean "set the intent to Stopped" then I agree! |
When a request to a sled-agent to change the state of an instance fails in such a way that indicates that the instance is no longer present on that sled, the instance record is moved to the
Failed
state. If an instance is in theFailed
state and auto-restart is enabled for that instance, then the control plane will always attempt to restart it.Both of these behaviors are fine in a vacuum, but in conjunction, they result in somewhat surprising behavior in the case where an instance which has failed is asked to stop. If the request that the sled-agent stop the instance fails in a way that indicates that the instance is gone, then Nexus will move the instance to
Failed
and attempt to restart it.1 This is, to use a technical term, wacky. The user asked that an instance not be running, we went and tried to stop it and discovered that it was, in fact, not running, and then we...decided to start it again? That's literally the opposite of the intended behavior, from the user's perspective. I described this in #6809.This commit fixes this (and lays the groundwork for some other future improvements to instance state management). The approach is as follows:
Stopped
,Running
, andDestroyed
.intended_state
of the instance record before doing anything else (starting sagas, sending requests to sled-agents, etc.)Failed
instance, consider the intended state as well as other variables, and only restart it if it is supposed to beRunning
Presently, this tracking of requested states is only used for determining whether to auto-restart an instance. However, it may be useful for at laest one other thing that kinda bothers me: when an instance is created with the
start: true
field in theInstanceCreate
params, the API endpoint runs aninstance_create
saga and then, once it completes, also runs aninstance_start
saga. Both operations are performed in sagas, so if the Nexus instance that is handling the API request dies unexpectedly, the user's intent should be carried out...except this isn't quite the case. If the Nexus dies while running theinstance_create
saga, or after theinstance_create
saga has completed but before theinstance_start
saga has been created and written to the database, then the instance will be created, but the user intent that the instance be started will not be carried out.This is a bit unfortunate, especially when the instance is created by automation against the Oxide API. A human starting an instance in the console may just notice that the instance has been created and is now in the
Stopped
state, and click on the "start" button, and just be a little bit annoyed that their instance didn't start automatically when they clicked a checkbox saying it should. But, automation against the API might somewhat reasonably expect that if an instance-create request hasstart: true
, the instance will be started once it is created, and not try to start it again, resulting in the instance never actually starting. We could fix this using the tracked instance intended states as well. The completion of theinstance-create
saga could "chain" into aninstance-start
saga if the instance's intended state isRunning
, or an RPW could ensure that aninstance-start
saga exists for all instances with that intended state. I haven't implemented that in this PR, but this change lays the groundwork for that as well.Fixes #6809.
Footnotes
Provided that auto-restart is configured for that instance. ↩