Skip to content

[nexus] track "intended" instance states #8053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Apr 27, 2025

When a request to a sled-agent to change the state of an instance fails in such a way that indicates that the instance is no longer present on that sled, the instance record is moved to the Failed state. If an instance is in the Failed state and auto-restart is enabled for that instance, then the control plane will always attempt to restart it.

Both of these behaviors are fine in a vacuum, but in conjunction, they result in somewhat surprising behavior in the case where an instance which has failed is asked to stop. If the request that the sled-agent stop the instance fails in a way that indicates that the instance is gone, then Nexus will move the instance to Failed and attempt to restart it.1 This is, to use a technical term, wacky. The user asked that an instance not be running, we went and tried to stop it and discovered that it was, in fact, not running, and then we...decided to start it again? That's literally the opposite of the intended behavior, from the user's perspective. I described this in #6809.

This commit fixes this (and lays the groundwork for some other future improvements to instance state management). The approach is as follows:

  1. Introduce a new "intended state" column to database instance records, tracking the most recently requested state of the instance. This column will only record "resting" states that can be requested through the API: Stopped, Running, and Destroyed.
  2. When handling API requests that change the state of an instance, immediately update the intended_state of the instance record before doing anything else (starting sagas, sending requests to sled-agents, etc.)
  3. When deciding whether to automatically restart a Failed instance, consider the intended state as well as other variables, and only restart it if it is supposed to be Running

Presently, this tracking of requested states is only used for determining whether to auto-restart an instance. However, it may be useful for at laest one other thing that kinda bothers me: when an instance is created with the start: true field in the InstanceCreate params, the API endpoint runs an instance_create saga and then, once it completes, also runs an instance_start saga. Both operations are performed in sagas, so if the Nexus instance that is handling the API request dies unexpectedly, the user's intent should be carried out...except this isn't quite the case. If the Nexus dies while running the instance_create saga, or after the instance_create saga has completed but before the instance_start saga has been created and written to the database, then the instance will be created, but the user intent that the instance be started will not be carried out.

This is a bit unfortunate, especially when the instance is created by automation against the Oxide API. A human starting an instance in the console may just notice that the instance has been created and is now in the Stopped state, and click on the "start" button, and just be a little bit annoyed that their instance didn't start automatically when they clicked a checkbox saying it should. But, automation against the API might somewhat reasonably expect that if an instance-create request has start: true, the instance will be started once it is created, and not try to start it again, resulting in the instance never actually starting. We could fix this using the tracked instance intended states as well. The completion of the instance-create saga could "chain" into an instance-start saga if the instance's intended state is Running, or an RPW could ensure that an instance-start saga exists for all instances with that intended state. I haven't implemented that in this PR, but this change lays the groundwork for that as well.

Fixes #6809.

Footnotes

  1. Provided that auto-restart is configured for that instance.

When a request to a sled-agent to change the state of an instance fails
in such a way that indicates that the instance is no
longer present on that sled, the instance record is moved to the
`Failed` state. If an instance is in the `Failed` state and auto-restart
is enabled for that instance, then the control plane will always attempt
to restart it.

Both of these behaviors are fine in a vacuum, but in conjunction, they
result in somewhat surprising behavior in the case where an instance
which has failed is asked to stop. If the request that the sled-agent
stop the instance fails in a way that indicates that the instance is
gone, then Nexus will move the instance to `Failed` and attempt to
restart it.[^1] This is, to use a technical term, wacky. The user asked
that an instance not be running, we went and tried to stop it and
discovered that it was, in fact, not running, and then we...decided to
start it again? That's literally the opposite of the intended behavior,
from the user's perspective. I described this in #6809.

This commit fixes this (and lays the groundwork for some other future
improvements to instance state management). The approach is as follows:

1. Introduce a new "intended state" column to database instance records,
   tracking the most recently requested state of the instance. This
   column will only record "resting" states that can be requested through
   the API: `Stopped`, `Running`, and `Destroyed`.
2. When handling API requests that change the state of an instance,
   immediately update the `intended_state` of the instance record before
   doing anything else (starting sagas, sending requests to sled-agents,
   etc.)
3. When deciding whether to automatically restart a `Failed` instance,
   consider the intended state as well as other variables, and only
   restart it if it is supposed to be `Running`

Presently, this tracking of requested states is only used for
determining whether to auto-restart an instance. However, it may be
useful for at laest one other thing that kinda bothers me: when an
instance is created with the `start: true` field in the `InstanceCreate`
params, the API endpoint runs an `instance_create` saga and then, once
it completes, also runs an `instance_start` saga. Both operations are
performed in sagas, so if the Nexus instance that is handling the API
request dies unexpectedly, the user's intent should be carried
out...except this isn't _quite_ the case. If the Nexus dies while
running the `instance_create` saga, or after the `instance_create` saga
has completed but before the `instance_start` saga has been created and
written to the database, then the instance will be created, but the user
intent that the instance be *started* will not be carried out.

This is a bit unfortunate, especially when the instance is created by
automation against the Oxide API. A human starting an instance in the
console may just notice that the instance has been created and is now in
the `Stopped` state, and click on the "start" button, and just be a
little bit annoyed that their instance didn't start automatically when
they clicked a checkbox saying it should. But, automation against the
API might somewhat reasonably expect that if an instance-create request
has `start: true`, the instance will be started once it is created, and
not try to start it again, resulting in the instance never actually
starting. We could fix this using the tracked instance intended states
as well. The completion of the `instance-create` saga could "chain" into
an `instance-start` saga if the instance's intended state is `Running`,
or an RPW could ensure that an `instance-start` saga exists for all
instances with that intended state. I haven't implemented that in this
PR, but this change lays the groundwork for that as well.

[^1]: Provided that auto-restart is configured for that instance.
@hawkw hawkw requested review from iximeow and gjcolombo April 27, 2025 17:59
@hawkw hawkw self-assigned this Apr 27, 2025
Comment on lines +3 to +11
-- this is a bit conservative: we assume the intended state is running if and
-- only if the instance is currently running, because we don't know what the
-- user's desired state was otherwise.
SET LOCAL disallow_full_table_scans = off;
UPDATE omicron.public.instance SET intended_state = CASE
WHEN state = 'destroyed' THEN 'destroyed'
WHEN state = 'vmm' THEN 'running'
ELSE 'stopped'
END;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we could be a bit fancier here if we also considered the VMM's state if the instance has one, but I wasn't sure whether that was easily possible here...if someone better at SQL than I am has suggestions, I'm all ears.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking through when this migration would happen in practice, we'd have done the normal parallel-park.sh that kills all the zones, so VMMs could be in ~any state if an instance has a VMM at all.. the issue would be if a VMM were stopping, stopped, or destroyed (and the instance itself was not in those states!), where we'd set the intended state back to running.

i think these cases are all ones we'd have unintentionally auto-restarted when the update finished anyway, so not being fancier here is pretty minor. does that sound right?

that said i think you can do something like

UPDATE instance SET intended_state = CASE
    WHEN instance.state = 'vmm' AND vmm.state = 'running' THEN 'running'
    ...
END FROM omicron.public.instance instance LEFT JOIN omicron.public.vmm ON instance.vmm.active_propolis_id = vmm.id

but i'm certain that the syntax is wrong in some dumb way and it might not be entirely worth fighting

Copy link
Member Author

@hawkw hawkw Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you're correct that at present, we will have already killed all the running VMMs when migrations are being run...I had gotten it into my head that it was worth making sure that the migration could handle a future where we didn't do that, but now that I think about it, it doesn't really make sense, since we wouldn't be upgrading from a future version that updated without killing VMMs...

@hawkw
Copy link
Member Author

hawkw commented Apr 27, 2025

As an aside, note that I did consider that perhaps we should just move the instance to Stopped rather than Failed if we discover that it has disappeared when the user requested that it be stopped, but I decided not to do that: it feels like the user should still be made aware of the failure as a distinct thing from "the instance was where we expected it to be and now it has stopped normally". I'm open to being convinced otherwise about that, though.

Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems good! i'm basically a +1, just the one dbg string that definitely ought to get fixed up first


// Okay, now let's try restarting the instance, failing it, and then
// asserting it *does* reincarnate.
dbg!(expect_instance_start_ok(client, instance_name).await);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do kinda wonder if it makes sense for the expect_*... helpers to trace!("starting {instance_name} and expecting it to work") so we don't feel a need to get that kinda activity log via dbg!(). might even work nicely if we #[track_caller] fn expect_instance_start_ok() and used the Location to log about where the helpers were being invoked...

obviously just tossing ideas out there, dbg!() is plenty common and this'd be pretty tangential.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally, i prefer to use dbg! for stuff that the main test function does, rather than stuffing it in the log, because everything that gets printed to stderr is displayed when the test fails, so you don't have to looker -f /tmp/test_whatever.0.some nonsense to get it. but others may disagree with me on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah that's true, i can see not having to fish up stuff from the log being nice. as someone who vim /tmp/test_whatever.0.blah.log i weigh my aesthetic preferences here very low :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even running it through looker first?

Courage is an Oxide value, I suppose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might even work nicely if we #[track_caller] fn expect_instance_start_ok() and used the Location to log about where the helpers were being invoked...

should also mention for future reference that #[track_caller] doesn't work correctly with async fns, sadly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh TIL, ow ow ow.

Comment on lines +1457 to +1458
dbg!(
instance_wait_for_state(client, instance_id, InstanceState::Failed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fun fact: apparently github syntax highlighting is criminally misleading in macros? instance_wait_for_state is not highlighted at all, so if you're skimming it looks like this is just some variable being printed (unlike function calls outside macros that get purpled) i totally missed the contents of dbg!() are load-bearing "make the test go" stuff...

(i know this highlights reasonably locally so it only really matters if you're looking through github anyway)

Comment on lines +1474 to +1486
for secs in 0..30 {
let state = instance_get(client, &instance_url).await;
assert_eq!(
state.runtime.run_state,
InstanceState::Failed,
"instance transitioned out of Failed (to {}) after {secs} \
seconds! state: {:#?}",
state.runtime.run_state,
state
);
assert_eq!(state.runtime.time_last_auto_restarted, None);
tokio::time::sleep(Duration::from_secs(1)).await;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinda nitpicky: is the loop actually necessary? we could tokio::time::sleep(Duration::from_secs(30)) and not lose that much by taking state.runtime.last_auto_restarted as the time that auto-restart incorrectly kicked this instance right? my thought is that the secondly polls only really help if the test would fail, and in those cases we'd go look at the logs, but the interesting Nexus activity would be mixed with a bunch of us asking "Is It Wrong Yet?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♀️ yeah, maybe it's better to just poll once; this whole bit is quite unfortunate honestly. i thought failing faster was maybe worth it (versus waiting all 30s every time) but i see your point about the logs.

Comment on lines +3 to +11
-- this is a bit conservative: we assume the intended state is running if and
-- only if the instance is currently running, because we don't know what the
-- user's desired state was otherwise.
SET LOCAL disallow_full_table_scans = off;
UPDATE omicron.public.instance SET intended_state = CASE
WHEN state = 'destroyed' THEN 'destroyed'
WHEN state = 'vmm' THEN 'running'
ELSE 'stopped'
END;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking through when this migration would happen in practice, we'd have done the normal parallel-park.sh that kills all the zones, so VMMs could be in ~any state if an instance has a VMM at all.. the issue would be if a VMM were stopping, stopped, or destroyed (and the instance itself was not in those states!), where we'd set the intended state back to running.

i think these cases are all ones we'd have unintentionally auto-restarted when the update finished anyway, so not being fancier here is pretty minor. does that sound right?

that said i think you can do something like

UPDATE instance SET intended_state = CASE
    WHEN instance.state = 'vmm' AND vmm.state = 'running' THEN 'running'
    ...
END FROM omicron.public.instance instance LEFT JOIN omicron.public.vmm ON instance.vmm.active_propolis_id = vmm.id

but i'm certain that the syntax is wrong in some dumb way and it might not be entirely worth fighting

Co-authored-by: iximeow <iximeow@oxide.computer>
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally LGTM, but I would add one bit: if the instance update saga detects that the instance's current VMM has moved to Stopped, I would also change the intent (as part of the broader update) to Stopped instead of leaving it in Running. The idea is to cover the case where a guest shuts down without going through the API: we may not know what the user intent was in that case (was the user expecting the guest to do this?), but I think leaving the intent unchanged increases the chance that we will later misinterpret the intent in this field and start an instance that previously stopped gracefully. (It also means that omdb won't display a self-stopped instance as having "running" intent: one might ask why such an instance is stopped if the intent says it should be running!)

To be clear, I think the code in this PR behaves as intended; this is just a dose of defensive programming (and the omdb observability improvement).

Comment on lines +1471 to +1473
// Unfortunately there isn't really a great way to assert "no start saga has
// been started", so we'll do the somewhat jankier thing of waiting a bit
// and making sure that the instance doesn't transition to Starting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How incredibly awful would it be to try to peer into the saga table omdb-style to try to look for one?


// Okay, now let's try restarting the instance, failing it, and then
// asserting it *does* reincarnate.
dbg!(expect_instance_start_ok(client, instance_name).await);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even running it through looker first?

Courage is an Oxide value, I suppose.

@hawkw
Copy link
Member Author

hawkw commented Apr 29, 2025

This generally LGTM, but I would add one bit: if the instance update saga detects that the instance's current VMM has moved to Stopped, I would also change the intent (as part of the broader update) to Stopped instead of leaving it in Running.

Yeah, that's a good thought, I'll make that change as well. I wonder if we want to have a different enum variant to represent that, so that we can distinguish between "stopped due to API request" and "shut down internally" in logs/omdb.

@hawkw
Copy link
Member Author

hawkw commented Apr 30, 2025

This generally LGTM, but I would add one bit: if the instance update saga detects that the instance's current VMM has moved to Stopped, I would also change the intent (as part of the broader update) to Stopped instead of leaving it in Running.

Yeah, that's a good thought, I'll make that change as well. I wonder if we want to have a different enum variant to represent that, so that we can distinguish between "stopped due to API request" and "shut down internally" in logs/omdb.

Okay @gjcolombo, I'm looking into adding this this morning. One thing I wanted to bring up is that if we start setting a "guest shutdown" intent when the VMM transitions to Stopped, we'll start instance-update sagas on the Stopped transition, which we don't currently do. Presently, we only run an update saga when the VMM goes to Destroyed. So, if we start setting a "guest stopped" intent from an update saga on Stopped VMM transitions, we're going to be starting a whole additional update saga that's basically JUST changing the intended state fields, which kind of feels like overkill. In this case, I wonder if it's better to just set that field outside the update saga...

@hawkw
Copy link
Member Author

hawkw commented Apr 30, 2025

Alternatively, I think it's probably fine to do that on the "VMM destroyed" transition instead of the "VMM stopped" transition, because in that case, we're already doing an update saga, and we won't change the instance's state to Stopped until the VMM is Destroyed anyway.

@gjcolombo
Copy link
Contributor

If by "do that" you mean "set the intent to Stopped" then I agree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

should instance-stop *really* move instances to failed when they are discovered to already be gone?
3 participants