Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HCP Cluster Resource Deletion Cascading Subscription Delete #920

Merged
merged 7 commits into from
Jan 15, 2025

Conversation

mbarnes
Copy link
Collaborator

@mbarnes mbarnes commented Dec 3, 2024

What this PR does

When a subscription state changes to "Deleted", the RP now triggers a deletion of all HCP clusters under the subscription as per the Resource Provider Contract.

Behind the scenes, this introduces the concept of "implicit" and "explicit" async operations:

  • An "implicit" async operation has an "Operation" item in Cosmos DB, but no status endpoint for ARM to poll.
  • An "explicit" async operation starts as an "implicit" operation. The Frontend.ExposeOperation method enriches the "Operation" item with information necessary to make the status endpoint accessible to ARM, and adds appropriate async headers to an http.ResponseWriter.

Importantly, the backend pod does not distinguish between implicit and explicit async operations. The sole purpose of an "implicit" async operation at the moment, which is only used for deletions, is for the backend to delete the "Resource" item in Cosmos DB after the actual resource is deleted.

Jira: ARO-13321 - Implement Cascading Subscription Deletion

Special notes for your reviewer

  • This duplicates a few database iterator commits from #883, which is still outstanding.
  • I'll add unit tests for this in a follow-up PR after I convert our existing unit tests to use gomock for Cosmos DB operations. To add unit tests now would just create extra work for myself in the conversion effort.
  • I still need to document asynchronous operation mechanics in general and this "implicit" vs "explicit" concept will be part of it. I've been holding off on writing documentation until I can make some (imo) necessary changes to our database design. This is just to say I haven't forgotten about it.

Copy link

github-actions bot commented Jan 8, 2025

Please rebase pull request.

Matthew Barnes added 7 commits January 8, 2025 10:49
Add "externalID" and "internalID" parameters so the returned
document is a minimum valid OperationDocument for writing.
The operation item must now be created in the database prior to
calling ExposeOperation. ExposeOperation does all its processing
in a database update callback.

This is because there is an increasing number of cases where we
create an implicit async operation with no visible status endpoint.
Calling ExposeOperation makes an implicit async operation explicit,
with a status endpoint for ARM to poll. Hence the rename.

The tradeoff is explicit asyncrhonous operations now require two
database operations (create and update) but it helps make the RP
logic cleaner. This could possibly be mitigated in the future by
using Cosmos DB's transactional batch operations, but it's gonna
take some serious refactoring to get there.
CancelActiveOperation marks the status of any active operation on
the resource as canceled.
Will be reusing DeleteResource for subscription deletion.

Add database bookkeeping for the resource and any child resources.
This includes creating implicit operations for each resource being
deleted. The caller may then expose the returned operation ID.
By my read of the Subscription Lifecycle API Reference [1], we
should favor 200 OK over 201 Created when creating or updating
a subscription.

[1]
https://github.com/cloud-and-ai-microsoft/resource-provider-contract/blob/master/v1.0/subscription-lifecycle-api-reference.md#response
Called when a subscription is deleted. The method is idempotent in
case of multiple subscription PUT requests.
Don't count on OperationID being set in OperationDocuments.
Implicit async operations will not have this field set. Get
the subscription ID from ExternalID instead.
Copy link
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a few small question and a thought

@@ -94,13 +95,15 @@ type OperationDocument struct {
Error *arm.CloudErrorBody `json:"error,omitempty"`
}

func NewOperationDocument(request OperationRequest) *OperationDocument {
func NewOperationDocument(request OperationRequest, externalID *arm.ResourceID, internalID ocm.InternalID) *OperationDocument {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we consider splitting this functionality into NewImplicitOperationDocument and NewExplicitOperationDocument? This would help cement the concept and make it more visible. We could even extend the OperationDocument Type with Implicit/Explicit but that might be overkill and I don't have a good feeling if that's worth it.

What do people think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the operation document, what distinguishes explicit from implicit operations is certain fields being populated: namely TenantID, ClientID and most importantly OperationID. I'm all for making the code clearer, though I'm not sure if a separate document type is the way I'd go. The backend pod, for example, doesn't care about this explicit vs implicit distinction and treats all operations the same. So I wouldn't want the backend to have to handle two document types.

Maybe an OperationDocument.IsExplicit method, that just returns whether the OperationID field is non-empty, would be sufficent? (Or OperationDocument.IsExposed to align with Frontend.ExposeOperation... I don't know, I'm playing fast and loose with terminology here.)

In any case, I don't know that we have a use case for such a method at the moment but I'll keep this in the back of my mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I don't have my head around it yet but hopefully the "right" answer appears at some stage :D

@@ -507,9 +507,18 @@ func (f *Frontend) ArmResourceCreateOrUpdate(writer http.ResponseWriter, request
}
}

operationDoc, err := f.StartOperation(writer, request, doc, operationRequest)
operationDoc := database.NewOperationDocument(operationRequest, doc.Key, doc.InternalID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side question: Any objection or issue if I make a PR to change Key -> ResourceId in the ResourceDocument Type just for readability?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this has bugged me for awhile. I've held off for the sake of backward-compatibility, but since I'm working on a much larger breaking change with our database code please feel free. (Or if you don't, I probably will.)

@mbarnes mbarnes merged commit 92f11fe into main Jan 15, 2025
10 checks passed
@mbarnes mbarnes deleted the subscription-deletion branch January 15, 2025 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants