-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic #1133
OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic #1133
Conversation
@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
b387d07
to
b07d602
Compare
/jira refresh |
@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
I missed one detail of the bug, missing instructions in the |
a057f1a
to
0716651
Compare
@lihongan Added the missing scope change instructions for the PowerVS type. /unhold |
infra failures |
0716651
to
f51b1d3
Compare
Infra issues |
pre-merge tested on OpenStack and looks good now
will test on IBMCloud later |
pre-merge tested on IBMCloud and also looks good
|
/assign |
Progressing
Condition
f51b1d3
to
7a97b9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you for the fix!
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: candita The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Releasing the hold due to no new reviews/comments. /hold cancel |
While maybe not absolutely critical, I consider this fix a major fix as users are prevented from recreating IngressController's service IBMCloud platforms (something we explicitly tell them to do when they change scope). I also have other commitments that prevent me from waiting another month to merge this. /label acknowledge-critical-fixes-only |
@@ -108,6 +108,23 @@ func Test_Delete(t *testing.T) { | |||
}, | |||
expectErrorContains: "error in ListAllDnsRecords", | |||
}, | |||
{ | |||
desc: "list success but missing ID", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should be "list failure (fake timeout) and missing ID". However, it really doesn't matter what the status is if the id is missing or mismatched, so maybe the description should be "list failure (missing ID)"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, a bit of a copypasta. The response code doesn't get checked if the error is nil, but I meant to just have it be a success to be a less confusing test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, the HTTP error is ignored because ListResourceRecords
returns a nil error value. I guess "list success" refers to the error value, not the HTTP code. It's kind of confusing to have a test case where the HTTP code indicates an error but ListResourceRecords
reports no error. I wouldn't have said anything if you had http.StatusOK
instead of http.StatusRequestTimeout
. I guess it doesn't really matter, and the conspicuous test input did prompt me to look more closely at the logic, which is a good thing. ¯\_(ツ)_/¯
{ | ||
desc: "list success but nil dns record list", | ||
DNSName: "testDelete", | ||
listAllDnsRecordsInputOutput: dnsclient.ListAllDnsRecordsInputOutput{ | ||
OutputError: nil, | ||
OutputResponse: &core.DetailedResponse{StatusCode: http.StatusOK}, | ||
OutputResult: &dnssvcsv1.ListResourceRecords{}, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The looks the same as "create happy path":
cluster-ingress-operator/pkg/dns/ibm/private/dnssvcs_provider_test.go
Lines 385 to 400 in c5d48ec
{ | |
desc: "create happy path", | |
DNSName: "testCreate", | |
target: "11.22.33.44", | |
recordedCall: "CREATE", | |
listAllDnsRecordsInputOutput: dnsclient.ListAllDnsRecordsInputOutput{ | |
OutputError: nil, | |
OutputResponse: &core.DetailedResponse{StatusCode: http.StatusOK}, | |
OutputResult: &dnssvcsv1.ListResourceRecords{}, | |
}, | |
createDnsRecordInputOutput: dnsclient.CreateDnsRecordInputOutput{ | |
InputId: "testCreate", | |
OutputError: nil, | |
OutputResponse: &core.DetailedResponse{StatusCode: http.StatusOK}, | |
}, | |
}, |
Ah, but "list success but nil dns record list" is missing target
. Should it rather be called "list success but no matching entry in dns record list"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess the "create happy path" is the same as "no matching entry in dns record list". In both cases, the list happens to be empty, but in this case, the target is an empty string.
It's weird that specifying an empty target doesn't cause an error, and it isn't clear that this test case was intended to test that scenario. Given that that's the current behavior, I don't mind having a test case for it. However, the test case should be named clearly (something like "create with an empty target"), and we might follow up with a fix to reject records with empty target and update the test case accordingly to expect an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you misplaced your comment on the delete unit test, but I think you meant to comment on the one for create (same named test cases).
Yea I actually think your first thought is correct, it feels redundant. The target does not impact anything in this test. Right, I was going to say, maybe it provides value by encoding/protecting a specific behavior, but that wasn't the intention.
I'll at least update the name.
Working through some comments |
if response != nil && response.StatusCode != http.StatusNotFound { | ||
return fmt.Errorf("createOrUpdateDNSRecord: failed to list the dns record: %w", err) | ||
} | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is really subtle, and it isn't clear to me how it relates to the description of the changes in the commit message. (Indeed, it would be easy not to notice this change at all, but for the change to the unit tests!)
Of course, you cannot keep the literal continue
statement since this logic is no longer inside a for
loop. However, you could have replaced continue
with return
. Presumably, the reason you instead simply drop the continue
is that the old logic was wrong: It was skipping the create/update in the case of a not-found error, but the correct behavior is to create in the case of a not-found error. Am I correctly understanding the old behavior and the intent of your change?
The commit message mentions two changes to this logic:
- "remove the filtering by target and rely solely on filtering by name" and
- "only create using the first target".
I think it makes sense to mention "fall through to the create logic in the case of a not-found error" as a distinct change as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably, the reason you instead simply drop the continue is that the old logic was wrong: It was skipping the create/update in the case of a not-found error, but the correct behavior is to create in the case of a not-found error. Am I correctly understanding the old behavior and the intent of your change?
Exactly, that was my thought. One other point:
- It was inconsistent with the createOrUpdateDNSRecord in dnsvcs_provider.go for no obvious reason. Based on the reason you explained, I dropped the continue/return, reference:
cluster-ingress-operator/pkg/dns/ibm/private/dnssvcs_provider.go
Lines 201 to 209 in b0b91bb
listResult, response, err := p.dnsService.ListResourceRecords(listOpt) if err != nil { // Avoid continuing with an invalid list response, as we can't determine // whether to create or update the DNS record, which may lead to further issues. if response == nil || response.StatusCode != http.StatusNotFound { return fmt.Errorf("createOrUpdateDNSRecord: failed to list the dns record: %w", err) } } if listResult == nil {
I'll update the commit message.
I've finished reviewing. Sorry for the last-minute comments! There's nothing absolutely critical, so |
The IBM Public Cloud DNS provider (`cis_provider.go`) had a bug in `createOrUpdateDNSRecord` where it checked for the existence of a DNS record by filtering both DNS name and target. If the target was updated (e.g., due to a load balancer recreation), the logic would not match the existing DNS record. As a result, the function would attempt to create a new record, but fail because a record with that name already existed, as multiple DNS records with the same name are not allowed in IBM Cloud DNS providers. The fix is to remove the filtering by target and rely solely on filtering by name, as the name is the only attribute that needs to be unique. This fix also includes a subtle fix in `createOrUpdateDNSRecord` of cis_provider.go to fall through to the create logic in the case of a not-found (404) error. Additionally, the IBM DNS logic doesn't work for multiple targets and this creates unexpected and problematic results. The logic has been refactored to only create using the first target and it warns the user when multiple targets are set. This change is low risk since the Ingress Operator will never create a DNSRecord with multiple targets in `desiredDNSRecord`. Lastly, modify TestScopeChange to check for `Available=True` after deleting the service to add test coverage for this issue.
b0b91bb
to
d2160e6
Compare
@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@Miciah ready for review/lgtm. |
Thanks! |
Another case where acquiring an AWS lease took 30 minutes and the test timed out after 4 hours. |
I forgot to remove the hold /hold cancel Installation failures and unrelated failures (nothing related to ibmcloud) |
edf5e71
into
openshift:master
@gcs278: Jira Issue OCPBUGS-32776: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-32776 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@gcs278: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
[ART PR BUILD NOTIFIER] Distgit: ose-cluster-ingress-operator |
The IBM Public Cloud DNS provider (
cis_provider.go
) had a bug increateOrUpdateDNSRecord
where it checked for the existence of a DNS record by filtering both DNS name and target. If the target was updated (e.g., due to a load balancer recreation), the logic would not match the existing DNS record. As a result, the function would attempt to create a new record, but fail because a record with that name already existed, as multiple DNS records with the samename are not allowed in IBM Cloud DNS providers.
The fix is to remove the filtering by target and rely solely on filtering by name, as the name is the only attribute that needs to be unique.
This fix also includes a subtle fix in
createOrUpdateDNSRecord
of cis_provider.go to fall through to the create logic in the case of a not-found (404) error.Additionally, the IBM DNS logic doesn't work for multiple targets and this creates unexpected and problematic results. The logic has been refactored to only create using the first target and it warns the user when multiple targets are set. This change is low risk since the Ingress Operator will never create a DNSRecord with multiple targets in
desiredDNSRecord
.Additionally, modify TestScopeChange to check for
Available=True
after deleting the service to add test coverage for this issue.This PR also includes some unit test fix up and missing unit test coverage for the IBM CIS Provider.
This resolves the same DNS issues for public PowerVS cloud as it uses the same logic.