Skip to content

Ingest ereports from SPs #370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 62 commits into
base: main
Choose a base branch
from
Open

Ingest ereports from SPs #370

wants to merge 62 commits into from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Mar 27, 2025

This pull request implements the MGS side of the SP ereport ingestion
protocol. For more information on the ereport ingestion protocol, refer
to the following RFDs:

In particular, this branch makes the following changes:

  • Add types to gateway-messages representing the ereport protocol wire
    messages exchanged between MGS and the SP; these are defined in
    RFD 545.
  • Somewhat substantial refactoring to the shared_socket module in
    gateway-sp-comms. Currently, the SharedSocket code for handling
    received packets is tightly coupled to the control plane agent message
    types. Ereport requests and responses are sent on a separate UDP port.
    Therefore, I've hacked up this code a bit to allow SharedSocket to
    be generic over a RecvHandler trait that defines how to handle
    received packets and dispatch them to single-SP handlers. This is
    implemented for both the control-plane-agent protocol and, separately,
    for the ereport protocol.
  • Actually add said implementation of the ereport protocol, including
    code for decoding ereport packets and a per-SP worker task that tracks
    the metadata sent by the SP and adds it to each batch of ereports.

A corresponding Omicron branch, oxidecomputer/omicron#7903, depends on
this branch and integrates the ereport code into the MGS app binary and
the SP simulator.

@hawkw hawkw requested a review from cbiffle March 27, 2025 23:08
@hawkw hawkw changed the title Ereport protocol messages Ingest ereports from SPs Apr 6, 2025
Copy link
Collaborator

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look particularly closely at the ereport parsing etc details (I'll defer that to you and Cliff, if that's okay). The structural MGS changes look good; just a handful of nits and questions.

recv_handler_task: JoinHandle<()>,
log: Logger,
}

impl Drop for SharedSocket {
// Hand-rolled `Debug` impl as the message type (`T`) needn't be `Debug` for the
// `SharedSocket` to be debug.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, but I'll plug https://docs.rs/derive-where/latest/derive_where/ since we use it in omicron, if you want to pull it in here too.

@@ -282,15 +284,18 @@ details in `dump.json`."
pub struct SingleSp {
interface: String,
cmds_tx: mpsc::Sender<InnerCommand>,
ereport_req_tx: mpsc::Sender<ereport::WorkerRequest>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not do this as a part of this PR, but I'm curious for your thoughts: there are a bunch of places like this one where we're going from one "thing" (in this case, an mpsc::Sender) to two "things". The original "one" is always named generically, and the new one is named indicating it's related to ereports. Do you think we should go back and rename the generic things to indicate they're intended for control-plane-agent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it might be worthwhile to do that (especially if we ever add a third port to the management network...). You're right that I just didn't really want to touch all the control-plane-agent code in this PR, but I'd be happy to do it in a separate commit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No argument from me. Do you think it would be clearer to move the control-plane-agent stuff to its own submodule too? (Not as part of this PR, but maybe alongside the renamings I suggested in another comment?)

hawkw added 4 commits April 10, 2025 11:29
i thought i had gotten rid of these...
as per [this comment][1] from @jgallagher. this is
similar to the control-plane-agent protocol. wow, it's
almost like we're reimplementing TCP (but without flow
control because thats hard).

[1]: #370 (comment)
@hawkw hawkw requested a review from jgallagher April 10, 2025 20:23
hawkw added 3 commits April 10, 2025 15:10
this way they get logged as UUIDs, which is nicer, as it matches
how the control plane formats them
hawkw added 18 commits April 14, 2025 11:08
per today's chat with @cbiffle.

this is to prevent a malformed ereport from wrecking the rest of the
packet if it contains a break byte (0xff) or similar.
this way, if we encounter an individual ereport that's malformed, we
decode the rest of the packet and let Nexus decide what to do.
this will make life easier for upstack software
this moves the message types out of `gateway-messages` (used by the
`control-plane-agent` task) to their own crate, so that the snitch task
can use them. i've also switched from hubpack serialization to
`zerocopy`, as this is what the snitch is using. this should permit
sharing the type defs between the SP and MGS.

note that i've also updated `zerocopy` from v0.6 to v0.8 here, as
(AFAICT) the older version doesn't know how to do fallible zerocopy
from-bytes conversions for enums. i'm happy to land the version update
separately.
hawkw added a commit that referenced this pull request Apr 29, 2025
This branch updates our `zerocopy` dependency from v0.6.x to v0.8.x. I
initially made the upgrade as I wanted to use some new `zerocopy`
features in #370, but have factored it out to land separately. Of
course, we'll need to update Hubris and MGS, as well.

Note that `zerocopy`'s `read_from_prefix` now returns the rest of the
buffer, making some code a little bit simpler where we were previously
doing that manually. Other than that, there's not a lot to this change,
besides deriving some additional marker traits (`Immutable` and
`KnownLayout`).
@hawkw hawkw mentioned this pull request Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants