Skip to content

Conversation

huww98
Copy link

@huww98 huww98 commented Aug 17, 2025

  • One-line PR description: add error handling strict mode

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huww98
Once this PR has been reviewed and has the lgtm label, please assign saad-ali for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from saad-ali August 17, 2025 14:55
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 17, 2025
@k8s-ci-robot k8s-ci-robot requested a review from xing-yang August 17, 2025 14:55
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 17, 2025
* When there is an error, we will only retry with parameters of A or B, not allowing any other target.
User can only change the target after either the modify or the rollback succeeded.
* if the `currentVolumeAttributesClass` is nil, we don't allow rollback.
Instead, we allow change target if the error is infeasible in this case, to allow user to correct some typos easily.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure this one will not break the quota integrity. We will need further restriction in CSI spec:

- The SP MUST NOT have applied any modification to the volume as part of this specific call
+ The SP MUST NOT have applied any modification to the volume as part of this specific call or any consecutive previous calls with the same parameters.

But this also troubles me because it means whenever SP updates it logic, it will need to consider capability to any previous version, since there may be some volumes stuck at error state for years. I know this is a very edge case, but without that, we cannot make a 100% promise even in strict mode.

switch spec {
case target, "": return true
case cur: return false
default: return status != "Infeasible"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic may requires more discussion. When current is A, target is B, spec is C, where should we go? Going to C is not allowed because it may break quota.
My current proposal is going to A if ModifyVolume(B) returns infeasible, otherwise continue to retry B, so that it will be more likely to automatically converge to the user specified state.
But maybe we should always retry B to make the system more predictable.

One can verify this contains all states by arbitrarily changing the spec and verify it will still hit a listed state.

For [strict mode](#strict-mode), the implementation is different:
* We only leave the state `cur == A && target == B` when either modify to A or modify to B success.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major difference to #5462 . I propose NOT to set target to A before rolling back, to keep tracking B if rollback still fails and maintain quota integrity.

@huww98
Copy link
Author

huww98 commented Aug 17, 2025

/cc @gnufied

@k8s-ci-robot k8s-ci-robot requested a review from gnufied August 17, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants