Skip to content

Conversation

@mquhuy
Copy link

@mquhuy mquhuy commented Aug 9, 2023

What this PR does / why we need it:

Setting a failure message is permanent and requires user action to get
rid of. We cannot know if the error is permanent though. The idea with
this commit is that if the machine has a nodeRef, we know that the
instance was working at some point, so the error could well be
temporary. Therefore we do not set failure in this case.
On the other hand, if the machine does not have a nodeRef, we set
failure since it is more likely that there is some configuration error
that will not go away.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1582 . For more details please read my comment on that issue.

TODOs:

  • squashed commits
  • if necessary:
    • includes documentation
    • adds unit tests

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 9, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 9, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: lentzi90 / name: Lennart Jern (1347c00)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 9, 2023
@netlify
Copy link

netlify bot commented Aug 9, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 1347c00
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/6502c46283d51b0008d8a133
😎 Deploy Preview https://deploy-preview-1637--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot
Copy link
Contributor

Welcome @mquhuy!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-openstack 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-openstack has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 9, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @mquhuy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 9, 2023
@mquhuy
Copy link
Author

mquhuy commented Aug 9, 2023

/EasyCLA
/pull-cluster-api-provider-openstack-build
/pull-cluster-api-provider-openstack-test

@lentzi90
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 14, 2023
@wwentland
Copy link
Contributor

Thank you, @mquhuy! Could you be so kind as to look into the status of your CLA again?

@lentzi90
Copy link
Contributor

He is in vacation this week. I will check it with him when he is back 🙂

@mdbooth
Copy link
Contributor

mdbooth commented Aug 17, 2023

We've long known we've got a bunch of cleanup work to do here, but until now I've thought the goal would be:

  • Failure before machine is Ready -> Failed
  • Failure after machine is Ready -> Retry forever

My thinking is that if the machine has not become Ready then it has no workload to worry about, so you might as well delete it. I'm prepared to be challenged on this thinking, though.

A concern I have is that this would mask some errors which probably shouldn't be retried, e.g. specifying an invalid resource. Some of these things we (sometimes) validate prior to server creation, but ultimately we rely on Nova to return us an error if, e.g. the user specified an invalid image ID. Not putting the machine into a failed state in this situation makes a configuration error much harder to find.

If we were going to ignore creation failure like this, at a minimum I'd like to work out precisely which errors are ignorable. We should also look at Nova's return value for server create to work out if it distinguishes between an invalid request and a transient error. It probably does, and if so we should distinguish between those two. Ideally we should ignore transient errors and retry, but we should not ignore invalid requests. If we can't do the work to distinguish between the two yet, my preference would probably be to leave it as-is and make all errors prior to machine Ready fatal. You can still just delete the machine.

Again, this is just my thinking, though. I'm very much prepared to have it challenged by real-world experience.

Interested to know if @mkjpryor or @mnaser have an opinion.

@wwentland
Copy link
Contributor

We've long known we've got a bunch of cleanup work to do here, but until now I've thought the goal would be:

* Failure before machine is Ready -> Failed
* Failure after machine is Ready -> Retry forever

To me this sounds like exactly the desired behaviour. The issue detailed in #1582 does, AFAIUI, happen when the machine has been Ready before, but does not recover from an error state once the network has been unavailable for extended periods of time.

I think this PR is motivated by the suspicion that the controller runs into the following code path when there is network instability as instanceStatus might be nil, even if the instance actually exists:

// Set an error message if we couldn't find the instance.
	if instanceStatus == nil {
		err = errors.New("OpenStack instance not found")
		openStackMachine.SetFailure(capierrors.UpdateMachineError, err)
		conditions.MarkFalse(openStackMachine, infrav1.InstanceReadyCondition, infrav1.InstanceNotFoundReason, clusterv1.ConditionSeverityError, "")
		return ctrl.Result{}, nil
	}

@lentzi90
Copy link
Contributor

Exactly on point @wwentland ! 🎯
The issue we saw was a machine that was ready, then connection to nova was lost, which put it into error state where it then stayed forever.

@lentzi90
Copy link
Contributor

I'm taking this over since @mquhuy is gone for a bit longer (we are colleagues).

@lentzi90 lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch 2 times, most recently from b4121ee to 407a95e Compare August 28, 2023 12:08
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 28, 2023
@lentzi90
Copy link
Contributor

@mdbooth @mkjpryor @mnaser any more comments?

@mdbooth
Copy link
Contributor

mdbooth commented Aug 30, 2023

I think this PR is motivated by the suspicion that the controller runs into the following code path when there is network instability as instanceStatus might be nil, even if the instance actually exists:

// Set an error message if we couldn't find the instance.
	if instanceStatus == nil {
		err = errors.New("OpenStack instance not found")
		openStackMachine.SetFailure(capierrors.UpdateMachineError, err)
		conditions.MarkFalse(openStackMachine, infrav1.InstanceReadyCondition, infrav1.InstanceNotFoundReason, clusterv1.ConditionSeverityError, "")
		return ctrl.Result{}, nil
	}

I would prefer to fix instanceStatus might be nil, even if the instance actually exists. That is certainly not the intention. If instanceStatus exists but can't be fetched for some reason (I think cell down would be the only cause of this other than a regular bug @stephenfin?) it should be an error, so we should not continue. If we are returning a nil instanceStatus with no error in any other circumstance I would prefer to address that instead.

@mkjpryor
Copy link

@mdbooth

This all sounds sensible. My one thought would be whether we have a timeout that results in a transition to Failed rather than retrying infinitely. This would help those of us building UIs on top of CAPI, where an infinitely retrying entity is quite hard for users to grok, especially non-technical ones…

@lentzi90 lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from 407a95e to c374567 Compare August 31, 2023 06:36
@lentzi90
Copy link
Contributor

The whole point is that if we call SetFailure, that is a permanent condition that the Machine at CAPI level will pick up and never update. We should not do that for a simple network issue that could be resolved on the next reconciliation. I'm saying we should retry, because that is the only way to resolve this kind of situation automatically. The alternative is to have the user delete/patch the Machine and this is definitely madness for non-technical users.

Regarding the instanceStatus being nil I'm a bit lost. I don't think there is a situation where the instance exists and the status is still nil. What can happen is that we manage to get the instances fine, but find that the one we are looking for is not yet created so we try to create it. At that point we can see network issues and end up with an error because we were unable to create it.

My suggestion is to only set conditions (not failure) for these two cases:

  1. We are unable to get the OpenStack instance
  2. We are unable to create the OpenStack instance

If we have specific cases that we know are permanent errors, then I'm fine with adding checks for them, but without checking the error I say we should treat it as temporary.

@lentzi90
Copy link
Contributor

/test pull-cluster-api-provider-openstack-e2e-test

@jichenjc
Copy link
Contributor

jichenjc commented Sep 1, 2023

We are unable to get the OpenStack instance
We are unable to create the OpenStack instance

can we define this more clearly ? e.g many cases we can say we can't create, like no network connection temply, or it's param error, or something wrong server side (openstack side return 500) but they can be fixed soon etc
so need define this and doc clearly on the desired behavior

@mdbooth
Copy link
Contributor

mdbooth commented Sep 1, 2023

There are currently only 3 places where we call machine.SetFailure:

Presumably your issue is the result of one of these 3.

The first is this one. As noted above I think that's dead code and we should just delete it, but (assuming it's dead code) it's not your issue.

The last isn't your issue either because you're not deleting. AFAICT the only way that could happen is if OpenStack was returning us data we can't parse, which should never happen. It's probably not worth spending a lot of time thinking about it.

So I'm guessing your issue is the second one. At the end of a reconcile the machine is in the ERROR state. My take on this one is that we should only SetFailure() here if the machine has ever been ready.

@lentzi90
Copy link
Contributor

lentzi90 commented Sep 1, 2023

Thanks @mdbooth that makes sense! I will try to reproduce the issue to double check this is indeed the case.

@lentzi90
Copy link
Contributor

Alright I now have some more information on this. It is not about network (at least not between CAPO and Openstack). The issue was that some part of Openstack was down (I guess Nova?), which caused all servers to report error status. So to summarize:

  • The VMs had been ready
  • There was a temporary issue with the Openstack environment that caused all servers to report error status
  • When the situation was resolved, all OpenStackMachines were stuck permanently in failed state

Sounds to me like it was working as intended. From a user point of view I can also understand the frustration that the Machines and OpenStackMachines stayed in failed state, but I guess this is such a rare/unique situation that it may still be for the best.

@mdbooth
Copy link
Contributor

mdbooth commented Sep 11, 2023

Alright I now have some more information on this. It is not about network (at least not between CAPO and Openstack). The issue was that some part of Openstack was down (I guess Nova?), which caused all servers to report error status. So to summarize:

  • The VMs had been ready
  • There was a temporary issue with the Openstack environment that caused all servers to report error status
  • When the situation was resolved, all OpenStackMachines were stuck permanently in failed state

Sounds to me like it was working as intended. From a user point of view I can also understand the frustration that the Machines and OpenStackMachines stayed in failed state, but I guess this is such a rare/unique situation that it may still be for the best.

I still think we can do better here. I don't think we should be setting the failed condition on a Machine which has had the opportunity to run a workload. How about we look to see if the Machine has a nodeRef? If it doesn't we set failed. If it does we retry forever and... do something else to make it really obvious there's a problem which needs looking at.

@lentzi90
Copy link
Contributor

I still think we can do better here. I don't think we should be setting the failed condition on a Machine which has had the opportunity to run a workload. How about we look to see if the Machine has a nodeRef? If it doesn't we set failed. If it does we retry forever and... do something else to make it really obvious there's a problem which needs looking at.

Sounds good, I will give it a go!

@lentzi90 lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from c374567 to 24f314e Compare September 13, 2023 11:38
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 13, 2023
@lentzi90 lentzi90 changed the title Do not set failure when unable to get Openstack instance Set failure only on instance error when no nodeRef Sep 13, 2023
@lentzi90 lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from 24f314e to abd6644 Compare September 13, 2023 11:41
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2023
@lentzi90 lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from abd6644 to c889efe Compare September 13, 2023 12:27
Setting a failure message is permanent and requires user action to get
rid of. We cannot know if the error is permanent though. The idea with
this commit is that if the machine has a nodeRef, we know that the
instance was working at some point, so the error could well be
temporary. Therefore we do not set failure in this case.
On the other hand, if the machine does not have a nodeRef, we set
failure since it is more likely that there is some configuration error
that will not go away.

Remove dead code and improve logging slightly in similar cases.
@lentzi90 lentzi90 force-pushed the mquhuy/not-set-failure-when-openstack-not-found branch from c889efe to 1347c00 Compare September 14, 2023 08:29
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 14, 2023
@lentzi90
Copy link
Contributor

/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lentzi90, mquhuy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2023
@k8s-ci-robot k8s-ci-robot merged commit 1aeacb7 into kubernetes-sigs:main Sep 14, 2023
@lentzi90 lentzi90 deleted the mquhuy/not-set-failure-when-openstack-not-found branch September 14, 2023 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reconciliation stops working as soon as machine is in error state

7 participants