11# KEP-961: Implement maxUnavailable in StatefulSet
22
3-
43<!--
54This is the title of your KEP. Keep it short, simple, and descriptive. A good
65title can help communicate what the KEP is and should be considered as part of
@@ -19,23 +18,25 @@ tags, and then generate with `hack/update-toc.sh`.
1918
2019<!-- toc -->
2120- [ Release Signoff Checklist] ( #release-signoff-checklist )
21+ - [ Table of Contents] ( #table-of-contents )
2222- [ Summary] ( #summary )
2323- [ Motivation] ( #motivation )
2424 - [ Goals] ( #goals )
2525 - [ Non-Goals] ( #non-goals )
2626- [ Proposal] ( #proposal )
27- - [ User Stories (Optional) ] ( #user-stories-optional )
27+ - [ User Stories] ( #user-stories )
2828 - [ Story 1] ( #story-1 )
29- - [ Story 2] ( #story-2 )
3029 - [ Notes/Constraints/Caveats (Optional)] ( #notesconstraintscaveats-optional )
3130 - [ Risks and Mitigations] ( #risks-and-mitigations )
3231- [ Design Details] ( #design-details )
32+ - [ Implementation Details] ( #implementation-details )
33+ - [ API Changes] ( #api-changes )
34+ - [ Implementation] ( #implementation )
3335 - [ Test Plan] ( #test-plan )
3436 - [ Prerequisite testing updates] ( #prerequisite-testing-updates )
35- - [ Unit tests] ( #unit-tests )
36- - [ Integration tests] ( #integration-tests )
37- - [ e2e tests] ( #e2e-tests )
38- - [ Graduation Criteria] ( #graduation-criteria )
37+ - [ Tests] ( #tests )
38+ - [ Test Plan] ( #test-plan-1 )
39+ - [ Graduation Criteria] ( #graduation-criteria )
3940 - [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy )
4041 - [ Version Skew Strategy] ( #version-skew-strategy )
4142- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
@@ -206,7 +207,7 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion
206207and make progress.
207208-->
208209
209- N/A
210+ None.
210211
211212## Proposal
212213
458459- maxUnavailable greater than 1 with partition and maxUnavailable greater than replicas
459460
460461#### Test Plan
461- For ` Alpha ` , unit tests and e2e tests will be added to test functionality at both
462+
463+ For ` Alpha ` , unit tests and integration tests will be added to test functionality at both
462464with feature flag enabled and disabled. Defaults will be verified so that users
463- who donot set this flag are not surprised at all.
465+ who do not set this flag are not surprised at all.
466+
467+ For ` Beta ` , add e2e tests.
464468
465469## Graduation Criteria
466470
@@ -604,11 +608,16 @@ maxUnavailable to a number greater than 1, but the invariants and the logic wil
604608maxUnavailable pods with the same identity and never more than maxUnavailable being deleted.
605609
606610###### What specific metrics should inform a rollback?
607- TODO when we reach Beta
611+
612+ When feature enabled but rolling update in a unexpected phenomenon like the update pods at a time is not equal to the
613+ ` maxUnavailable ` value or rolling update in a unexpected order.
614+
615+ Or we can refer to the ` rolling-update-duration ` metric for observation, if it didn't decrease when setting the ` maxUnavailable `
616+ great than 1 or the duration increased abnormally, then we should rollback.
608617
609618###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
610- Will be tested when graduating to Beta.
611619
620+ No, but it will be tested manually before merging the PR.
612621
613622###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
614623No
@@ -623,32 +632,43 @@ The below command should show maxUnavailable value:
623632kubectl get statefulsets -o yaml | grep maxUnavailable
624633```
625634
635+ Or refer to the new metric ` rolling-update-duration ` , it should exist.
636+
626637###### How can someone using this feature know that it is working for their instance?
627- TODO when we reach Beta
638+
639+ With feature enabled, set the ` maxUnavailable ` great than 1, and pay attention to the rolling update pods at a time,
640+ it should equal to the ` maxUnavailable ` .
641+ Or when setting the ` maxUnavailable ` great than 1, the ` rolling-update-duration ` should decrease.
628642
629643###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
630644
645+ I think it has little relevance with SLOs, but rolling update at a very low speed which impacts the running services.
646+
631647###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
632648
649+ None.
650+
633651###### Are there any missing metrics that would be useful to have to improve observability of this feature?
634652
653+ None.
654+
635655### Dependencies
636656
637657###### Does this feature depend on any specific services running in the cluster?
638- NA
658+ No.
639659
640660### Scalability
641661
642662###### Will enabling / using this feature result in any new API calls?
643- It doesnt make any extra API calls.
663+
664+ It doesn't make any extra API calls.
644665
645666###### Will enabling / using this feature result in introducing new API types?
646667No
647668
648669###### Will enabling / using this feature result in any new calls to the cloud provider?
649670No
650671
651-
652672###### Will enabling / using this feature result in increasing size or count of the existing API objects?
653673A struct gets added to every StatefulSet object which has three fields, one 32 bit integer and two fields of type string.
654674The struct in question is IntOrString.
@@ -661,25 +681,49 @@ The controller-manager will see very negligible and almost un-notoceable increas
661681
662682###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
663683
684+ No.
685+
664686### Troubleshooting
665687
666688###### How does this feature react if the API server and/or etcd is unavailable?
667689The RollingUpdate will fail or will not be able to proceed if etcd or apiserver is unavailable and
668690hence this feature will also be not be able to be used.
669691
670692###### What are other known failure modes?
671- NA
693+
694+ <!--
695+ For each of them, fill in the following information by copying the below template:
696+ - [Failure mode brief description]
697+ - Detection: How can it be detected via metrics? Stated another way:
698+ how can an operator troubleshoot without logging into a master or worker node?
699+ - Mitigations: What can be done to stop the bleeding, especially for already
700+ running user workloads?
701+ - Diagnostics: What are the useful log messages and their required logging
702+ levels that could help debug the issue?
703+ Not required until feature graduated to beta.
704+ - Testing: Are there any tests for failure mode? If not, describe why.
705+ -->
706+
707+ In a multi-master setup, when the cluster has skewed CCM, the behaviors may different.
708+
709+ - [ Failure mode brief description]
710+ - Detection: the ` rolling-update-duration ` didn't decrease when setting the ` maxUnavailable ` great than 1 or increased abnormally.
711+ - Mitigations: Disable the feature.
712+ - Diagnostics: Set the logger level great than 4.
713+ - Testing: No testing, because the rolling update duration is hard to measure, it can be impact by a lot of things,
714+ like the master performance.
672715
673716###### What steps should be taken if SLOs are not being met to determine the problem?
674717
675718## Implementation History
676719
677720- KEP Started on 1/1/2019
678721- Implementation PR and UT by 8/30
722+ - Bump to beta at 2023-05-11
679723
680724## Drawbacks
681725
682- NA
726+ None.
683727
684728## Alternatives
685729
@@ -689,4 +733,6 @@ section.
689733- Another alternative would be to use OnDelete and deploy your own Custom Controller on top of StatefulSet Pods. There you can implement
690734your own logic for deleting more than one pods in a specific order. This requires more work on the user but give them ultimate flexibility.
691735
692- ## Infrastructure Needed (Optional)
736+ ## Infrastructure Needed (Optional)
737+
738+ No.
0 commit comments