Set failure only on instance error when no nodeRef #1637

mquhuy · 2023-08-09T08:43:32Z

What this PR does / why we need it:

Setting a failure message is permanent and requires user action to get
rid of. We cannot know if the error is permanent though. The idea with
this commit is that if the machine has a nodeRef, we know that the
instance was working at some point, so the error could well be
temporary. Therefore we do not set failure in this case.
On the other hand, if the machine does not have a nodeRef, we set
failure since it is more likely that there is some configuration error
that will not go away.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1582 . For more details please read my comment on that issue.

TODOs:

squashed commits
if necessary:
- includes documentation
- adds unit tests

/hold

linux-foundation-easycla · 2023-08-09T08:43:35Z

The committers listed above are authorized under a signed CLA.

✅ login: lentzi90 / name: Lennart Jern (1347c00)

netlify · 2023-08-09T08:43:39Z

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name	Link
🔨 Latest commit	`1347c00`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/6502c46283d51b0008d8a133
😎 Deploy Preview	https://deploy-preview-1637--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

k8s-ci-robot · 2023-08-09T08:43:40Z

Welcome @mquhuy!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-openstack 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-openstack has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-08-09T08:43:41Z

Hi @mquhuy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mquhuy · 2023-08-09T08:45:23Z

/EasyCLA
/pull-cluster-api-provider-openstack-build
/pull-cluster-api-provider-openstack-test

lentzi90 · 2023-08-14T12:43:29Z

/ok-to-test

wwentland · 2023-08-15T10:46:57Z

Thank you, @mquhuy! Could you be so kind as to look into the status of your CLA again?

lentzi90 · 2023-08-16T05:51:34Z

He is in vacation this week. I will check it with him when he is back 🙂

mdbooth · 2023-08-17T13:21:18Z

We've long known we've got a bunch of cleanup work to do here, but until now I've thought the goal would be:

Failure before machine is Ready -> Failed
Failure after machine is Ready -> Retry forever

My thinking is that if the machine has not become Ready then it has no workload to worry about, so you might as well delete it. I'm prepared to be challenged on this thinking, though.

A concern I have is that this would mask some errors which probably shouldn't be retried, e.g. specifying an invalid resource. Some of these things we (sometimes) validate prior to server creation, but ultimately we rely on Nova to return us an error if, e.g. the user specified an invalid image ID. Not putting the machine into a failed state in this situation makes a configuration error much harder to find.

If we were going to ignore creation failure like this, at a minimum I'd like to work out precisely which errors are ignorable. We should also look at Nova's return value for server create to work out if it distinguishes between an invalid request and a transient error. It probably does, and if so we should distinguish between those two. Ideally we should ignore transient errors and retry, but we should not ignore invalid requests. If we can't do the work to distinguish between the two yet, my preference would probably be to leave it as-is and make all errors prior to machine Ready fatal. You can still just delete the machine.

Again, this is just my thinking, though. I'm very much prepared to have it challenged by real-world experience.

Interested to know if @mkjpryor or @mnaser have an opinion.

wwentland · 2023-08-17T13:55:55Z

We've long known we've got a bunch of cleanup work to do here, but until now I've thought the goal would be:
* Failure before machine is Ready -> Failed
* Failure after machine is Ready -> Retry forever

To me this sounds like exactly the desired behaviour. The issue detailed in #1582 does, AFAIUI, happen when the machine has been Ready before, but does not recover from an error state once the network has been unavailable for extended periods of time.

I think this PR is motivated by the suspicion that the controller runs into the following code path when there is network instability as instanceStatus might be nil, even if the instance actually exists:

// Set an error message if we couldn't find the instance.
	if instanceStatus == nil {
		err = errors.New("OpenStack instance not found")
		openStackMachine.SetFailure(capierrors.UpdateMachineError, err)
		conditions.MarkFalse(openStackMachine, infrav1.InstanceReadyCondition, infrav1.InstanceNotFoundReason, clusterv1.ConditionSeverityError, "")
		return ctrl.Result{}, nil
	}

lentzi90 · 2023-08-18T04:31:37Z

Exactly on point @wwentland ! 🎯
The issue we saw was a machine that was ready, then connection to nova was lost, which put it into error state where it then stayed forever.

lentzi90 · 2023-08-28T12:06:42Z

I'm taking this over since @mquhuy is gone for a bit longer (we are colleagues).

lentzi90 · 2023-08-29T05:09:08Z

@mdbooth @mkjpryor @mnaser any more comments?

mdbooth · 2023-08-30T10:33:16Z

I think this PR is motivated by the suspicion that the controller runs into the following code path when there is network instability as instanceStatus might be nil, even if the instance actually exists:
// Set an error message if we couldn't find the instance.
	if instanceStatus == nil {
		err = errors.New("OpenStack instance not found")
		openStackMachine.SetFailure(capierrors.UpdateMachineError, err)
		conditions.MarkFalse(openStackMachine, infrav1.InstanceReadyCondition, infrav1.InstanceNotFoundReason, clusterv1.ConditionSeverityError, "")
		return ctrl.Result{}, nil
	}

I would prefer to fix instanceStatus might be nil, even if the instance actually exists. That is certainly not the intention. If instanceStatus exists but can't be fetched for some reason (I think cell down would be the only cause of this other than a regular bug @stephenfin?) it should be an error, so we should not continue. If we are returning a nil instanceStatus with no error in any other circumstance I would prefer to address that instead.

mkjpryor · 2023-08-30T15:34:47Z

@mdbooth

This all sounds sensible. My one thought would be whether we have a timeout that results in a transition to Failed rather than retrying infinitely. This would help those of us building UIs on top of CAPI, where an infinitely retrying entity is quite hard for users to grok, especially non-technical ones…

lentzi90 · 2023-08-31T06:47:45Z

The whole point is that if we call SetFailure, that is a permanent condition that the Machine at CAPI level will pick up and never update. We should not do that for a simple network issue that could be resolved on the next reconciliation. I'm saying we should retry, because that is the only way to resolve this kind of situation automatically. The alternative is to have the user delete/patch the Machine and this is definitely madness for non-technical users.

Regarding the instanceStatus being nil I'm a bit lost. I don't think there is a situation where the instance exists and the status is still nil. What can happen is that we manage to get the instances fine, but find that the one we are looking for is not yet created so we try to create it. At that point we can see network issues and end up with an error because we were unable to create it.

My suggestion is to only set conditions (not failure) for these two cases:

We are unable to get the OpenStack instance
We are unable to create the OpenStack instance

If we have specific cases that we know are permanent errors, then I'm fine with adding checks for them, but without checking the error I say we should treat it as temporary.

lentzi90 · 2023-08-31T09:35:00Z

/test pull-cluster-api-provider-openstack-e2e-test

controllers/openstackmachine_controller.go

jichenjc · 2023-09-01T04:03:51Z

We are unable to get the OpenStack instance
We are unable to create the OpenStack instance

can we define this more clearly ? e.g many cases we can say we can't create, like no network connection temply, or it's param error, or something wrong server side (openstack side return 500) but they can be fixed soon etc
so need define this and doc clearly on the desired behavior

mdbooth · 2023-09-01T10:03:24Z

There are currently only 3 places where we call machine.SetFailure:

Presumably your issue is the result of one of these 3.

The first is this one. As noted above I think that's dead code and we should just delete it, but (assuming it's dead code) it's not your issue.

The last isn't your issue either because you're not deleting. AFAICT the only way that could happen is if OpenStack was returning us data we can't parse, which should never happen. It's probably not worth spending a lot of time thinking about it.

So I'm guessing your issue is the second one. At the end of a reconcile the machine is in the ERROR state. My take on this one is that we should only SetFailure() here if the machine has ever been ready.

lentzi90 · 2023-09-01T10:08:03Z

Thanks @mdbooth that makes sense! I will try to reproduce the issue to double check this is indeed the case.

lentzi90 · 2023-09-11T10:48:25Z

Alright I now have some more information on this. It is not about network (at least not between CAPO and Openstack). The issue was that some part of Openstack was down (I guess Nova?), which caused all servers to report error status. So to summarize:

The VMs had been ready
There was a temporary issue with the Openstack environment that caused all servers to report error status
When the situation was resolved, all OpenStackMachines were stuck permanently in failed state

Sounds to me like it was working as intended. From a user point of view I can also understand the frustration that the Machines and OpenStackMachines stayed in failed state, but I guess this is such a rare/unique situation that it may still be for the best.

mdbooth · 2023-09-11T14:11:42Z

Alright I now have some more information on this. It is not about network (at least not between CAPO and Openstack). The issue was that some part of Openstack was down (I guess Nova?), which caused all servers to report error status. So to summarize:

The VMs had been ready

There was a temporary issue with the Openstack environment that caused all servers to report error status

When the situation was resolved, all OpenStackMachines were stuck permanently in failed state

Sounds to me like it was working as intended. From a user point of view I can also understand the frustration that the Machines and OpenStackMachines stayed in failed state, but I guess this is such a rare/unique situation that it may still be for the best.

I still think we can do better here. I don't think we should be setting the failed condition on a Machine which has had the opportunity to run a workload. How about we look to see if the Machine has a nodeRef? If it doesn't we set failed. If it does we retry forever and... do something else to make it really obvious there's a problem which needs looking at.

lentzi90 · 2023-09-12T05:26:25Z

I still think we can do better here. I don't think we should be setting the failed condition on a Machine which has had the opportunity to run a workload. How about we look to see if the Machine has a nodeRef? If it doesn't we set failed. If it does we retry forever and... do something else to make it really obvious there's a problem which needs looking at.

Sounds good, I will give it a go!

controllers/openstackmachine_controller.go

Setting a failure message is permanent and requires user action to get rid of. We cannot know if the error is permanent though. The idea with this commit is that if the machine has a nodeRef, we know that the instance was working at some point, so the error could well be temporary. Therefore we do not set failure in this case. On the other hand, if the machine does not have a nodeRef, we set failure since it is more likely that there is some configuration error that will not go away. Remove dead code and improve logging slightly in similar cases.

mdbooth

/lgtm

lentzi90 · 2023-09-14T10:33:42Z

/approve
/hold cancel

k8s-ci-robot · 2023-09-14T10:33:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lentzi90, mquhuy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lentzi90]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 9, 2023

k8s-ci-robot requested review from jichenjc and seanschneeweiss August 9, 2023 08:43

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 9, 2023

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 9, 2023

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 9, 2023

mquhuy mentioned this pull request Aug 9, 2023

Reconciliation stops working as soon as machine is in error state #1582

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 14, 2023

lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch 2 times, most recently from b4121ee to 407a95e Compare August 28, 2023 12:08

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 28, 2023

lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from 407a95e to c374567 Compare August 31, 2023 06:36

mdbooth reviewed Aug 31, 2023

View reviewed changes

controllers/openstackmachine_controller.go Show resolved Hide resolved

lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from c374567 to 24f314e Compare September 13, 2023 11:38

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 13, 2023

lentzi90 changed the title ~~Do not set failure when unable to get Openstack instance~~ Set failure only on instance error when no nodeRef Sep 13, 2023

lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from 24f314e to abd6644 Compare September 13, 2023 11:41

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2023

mdbooth reviewed Sep 13, 2023

View reviewed changes

controllers/openstackmachine_controller.go Outdated Show resolved Hide resolved

mdbooth reviewed Sep 13, 2023

View reviewed changes

controllers/openstackmachine_controller.go Outdated Show resolved Hide resolved

lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from abd6644 to c889efe Compare September 13, 2023 12:27

mdbooth reviewed Sep 13, 2023

View reviewed changes

controllers/openstackmachine_controller.go Outdated Show resolved Hide resolved

lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from c889efe to 1347c00 Compare September 14, 2023 08:29

mdbooth reviewed Sep 14, 2023

View reviewed changes

k8s-ci-robot assigned mdbooth Sep 14, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 14, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2023

k8s-ci-robot merged commit 1aeacb7 into kubernetes-sigs:main Sep 14, 2023

lentzi90 deleted the mquhuy/not-set-failure-when-openstack-not-found branch September 14, 2023 12:21

Set failure only on instance error when no nodeRef #1637

Set failure only on instance error when no nodeRef #1637

Uh oh!

Conversation

mquhuy commented Aug 9, 2023 • edited by lentzi90 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Uh oh!

k8s-ci-robot commented Aug 9, 2023

Uh oh!

k8s-ci-robot commented Aug 9, 2023

Uh oh!

mquhuy commented Aug 9, 2023

Uh oh!

lentzi90 commented Aug 14, 2023

Uh oh!

wwentland commented Aug 15, 2023

Uh oh!

lentzi90 commented Aug 16, 2023

Uh oh!

mdbooth commented Aug 17, 2023

Uh oh!

wwentland commented Aug 17, 2023

Uh oh!

lentzi90 commented Aug 18, 2023

Uh oh!

lentzi90 commented Aug 28, 2023

Uh oh!

lentzi90 commented Aug 29, 2023

Uh oh!

mdbooth commented Aug 30, 2023

Uh oh!

mkjpryor commented Aug 30, 2023

Uh oh!

lentzi90 commented Aug 31, 2023

Uh oh!

lentzi90 commented Aug 31, 2023

Uh oh!

Uh oh!

jichenjc commented Sep 1, 2023

Uh oh!

mdbooth commented Sep 1, 2023

Uh oh!

lentzi90 commented Sep 1, 2023

Uh oh!

lentzi90 commented Sep 11, 2023

Uh oh!

mdbooth commented Sep 11, 2023

Uh oh!

lentzi90 commented Sep 12, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mdbooth left a comment

Choose a reason for hiding this comment

Uh oh!

lentzi90 commented Sep 14, 2023

Uh oh!

k8s-ci-robot commented Sep 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mquhuy commented Aug 9, 2023 •

edited by lentzi90

Loading

linux-foundation-easycla bot commented Aug 9, 2023 •

edited

Loading

netlify bot commented Aug 9, 2023 •

edited

Loading