-
Notifications
You must be signed in to change notification settings - Fork 281
Set failure only on instance error when no nodeRef #1637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set failure only on instance error when no nodeRef #1637
Conversation
|
|
✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
Welcome @mquhuy! |
|
Hi @mquhuy. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/EasyCLA |
|
/ok-to-test |
|
Thank you, @mquhuy! Could you be so kind as to look into the status of your CLA again? |
|
He is in vacation this week. I will check it with him when he is back 🙂 |
|
We've long known we've got a bunch of cleanup work to do here, but until now I've thought the goal would be:
My thinking is that if the machine has not become Ready then it has no workload to worry about, so you might as well delete it. I'm prepared to be challenged on this thinking, though. A concern I have is that this would mask some errors which probably shouldn't be retried, e.g. specifying an invalid resource. Some of these things we (sometimes) validate prior to server creation, but ultimately we rely on Nova to return us an error if, e.g. the user specified an invalid image ID. Not putting the machine into a failed state in this situation makes a configuration error much harder to find. If we were going to ignore creation failure like this, at a minimum I'd like to work out precisely which errors are ignorable. We should also look at Nova's return value for server create to work out if it distinguishes between an invalid request and a transient error. It probably does, and if so we should distinguish between those two. Ideally we should ignore transient errors and retry, but we should not ignore invalid requests. If we can't do the work to distinguish between the two yet, my preference would probably be to leave it as-is and make all errors prior to machine Ready fatal. You can still just delete the machine. Again, this is just my thinking, though. I'm very much prepared to have it challenged by real-world experience. |
To me this sounds like exactly the desired behaviour. The issue detailed in #1582 does, AFAIUI, happen when the machine has been Ready before, but does not recover from an error state once the network has been unavailable for extended periods of time. I think this PR is motivated by the suspicion that the controller runs into the following code path when there is network instability as |
|
Exactly on point @wwentland ! 🎯 |
|
I'm taking this over since @mquhuy is gone for a bit longer (we are colleagues). |
b4121ee to
407a95e
Compare
I would prefer to fix |
|
This all sounds sensible. My one thought would be whether we have a timeout that results in a transition to Failed rather than retrying infinitely. This would help those of us building UIs on top of CAPI, where an infinitely retrying entity is quite hard for users to grok, especially non-technical ones… |
407a95e to
c374567
Compare
|
The whole point is that if we call Regarding the My suggestion is to only set conditions (not failure) for these two cases:
If we have specific cases that we know are permanent errors, then I'm fine with adding checks for them, but without checking the error I say we should treat it as temporary. |
|
/test pull-cluster-api-provider-openstack-e2e-test |
can we define this more clearly ? e.g many cases we can say we can't create, like no network connection temply, or it's param error, or something wrong server side (openstack side return 500) but they can be fixed soon etc |
|
There are currently only 3 places where we call machine.SetFailure:
Presumably your issue is the result of one of these 3. The first is this one. As noted above I think that's dead code and we should just delete it, but (assuming it's dead code) it's not your issue. The last isn't your issue either because you're not deleting. AFAICT the only way that could happen is if OpenStack was returning us data we can't parse, which should never happen. It's probably not worth spending a lot of time thinking about it. So I'm guessing your issue is the second one. At the end of a reconcile the machine is in the ERROR state. My take on this one is that we should only SetFailure() here if the machine has ever been ready. |
|
Thanks @mdbooth that makes sense! I will try to reproduce the issue to double check this is indeed the case. |
|
Alright I now have some more information on this. It is not about network (at least not between CAPO and Openstack). The issue was that some part of Openstack was down (I guess Nova?), which caused all servers to report error status. So to summarize:
Sounds to me like it was working as intended. From a user point of view I can also understand the frustration that the Machines and OpenStackMachines stayed in failed state, but I guess this is such a rare/unique situation that it may still be for the best. |
I still think we can do better here. I don't think we should be setting the failed condition on a Machine which has had the opportunity to run a workload. How about we look to see if the Machine has a nodeRef? If it doesn't we set failed. If it does we retry forever and... do something else to make it really obvious there's a problem which needs looking at. |
Sounds good, I will give it a go! |
c374567 to
24f314e
Compare
24f314e to
abd6644
Compare
abd6644 to
c889efe
Compare
Setting a failure message is permanent and requires user action to get rid of. We cannot know if the error is permanent though. The idea with this commit is that if the machine has a nodeRef, we know that the instance was working at some point, so the error could well be temporary. Therefore we do not set failure in this case. On the other hand, if the machine does not have a nodeRef, we set failure since it is more likely that there is some configuration error that will not go away. Remove dead code and improve logging slightly in similar cases.
c889efe to
1347c00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lentzi90, mquhuy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Setting a failure message is permanent and requires user action to get
rid of. We cannot know if the error is permanent though. The idea with
this commit is that if the machine has a nodeRef, we know that the
instance was working at some point, so the error could well be
temporary. Therefore we do not set failure in this case.
On the other hand, if the machine does not have a nodeRef, we set
failure since it is more likely that there is some configuration error
that will not go away.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #1582 . For more details please read my comment on that issue.
TODOs:
/hold