You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -79,16 +79,16 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
79
79
-[x] (R) KEP approvers have approved the KEP status as `implementable`
80
80
-[x] (R) Design details are appropriately documented
81
81
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
82
-
-[] e2e Tests for all Beta API Operations (endpoints)
82
+
-[x] e2e Tests for all Beta API Operations (endpoints)
83
83
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
84
84
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
85
85
-[ ] (R) Graduation criteria is in place
86
86
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
87
-
-[] (R) Production readiness review completed
87
+
-[x] (R) Production readiness review completed
88
88
-[ ] (R) Production readiness review approved
89
89
-[x] "Implementation History" section is up-to-date for milestone
90
-
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
91
-
-[x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
90
+
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
91
+
-[] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
92
92
93
93
<!--
94
94
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
@@ -235,6 +235,7 @@ bogged down.
235
235
-->
236
236
237
237
#### Story 1
238
+
238
239
As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a
239
240
RollingUpdate manner, if my Stateful app can tolerate more than one pod being down, thus allowing my
240
241
update to finish much faster.
@@ -248,7 +249,9 @@ Go in to as much detail as necessary here.
248
249
This might be a good place to talk about core concepts and how they relate.
249
250
-->
250
251
251
-
No.
252
+
With `maxUnavailable` feature enabled, we'll bring down more than one pod at a time, if your application can't tolerate
253
+
this behavior, you should absolutely disable this feature or leave the field unconfigured, which will behave as the default
We'll add a new metric named `rolling-update-duration-seconds`, it tracks how long a statefulset takes to finish a rolling-update,
434
+
the general logic is:
435
+
436
+
when [performing update](https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/controller/statefulset/stateful_set_control.go#L97-L138),
437
+
if the `currentRevision` != `updateRevision`, then we take it as starting to rolling update, but because rolling update can not
438
+
finish in one reconciling loop, then we may have to track it, we have two approaches:
439
+
440
+
- One is add a new field in the `defaultStatefulSetControl`, like below:
441
+
442
+
```golang
443
+
type defaultStatefulSetControl struct {
444
+
podControl *StatefulPodControl
445
+
statusUpdater StatefulSetStatusUpdaterInterface
446
+
controllerHistory history.Interface
447
+
recorder record.EventRecorder
448
+
449
+
// <Newly Added>
450
+
// rollingUpdateStartTimes records when each statefulset starts to perform rolling update.
451
+
// key is the statefulset name, value is the start time.
452
+
rollingUpdateStartTimes map[string]time.Time
453
+
}
454
+
```
455
+
456
+
Then when `currentRevision` != `updateRevision`, we'll check whether the statefulset name
457
+
exists in the `rollingUpdateStartTimes`, if exists, skip, if not, write the
458
+
start time.
459
+
460
+
If we found rolling update completed in [updateStatefulSetStatus](https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/controller/statefulset/stateful_set_control.go#L682-L701),
461
+
then we'll report the metrics here and clear the statefulset from the `rollingUpdateStartTimes`.
462
+
463
+
- Another approach is add a new field to `StatefulSetStatus` as below:
464
+
465
+
```golang
466
+
typeStatefulSetStatusstruct {
467
+
468
+
// rollingUpdateStartTime will record when the statefulSet starts to rolling update.
469
+
rollingUpdateStartTime *time.Time
470
+
}
471
+
```
472
+
473
+
The logic general looks the same, we'll update the time together with status update, so there's no
474
+
extra api calls.
475
+
476
+
Prefer Opt-2 as we already record the rolling update messages in the status.
477
+
428
478
### Test Plan
429
479
430
480
<!--
@@ -482,6 +532,11 @@ Testcases:
482
532
- maxUnavailable greater than 1 with partition and staged pods greater than maxUnavailable
483
533
- maxUnavailable greater than 1 with partition and maxUnavailable greater than replicas
484
534
535
+
New testcases being added:
536
+
537
+
- New metric `rolling-update-duration-seconds` should calculate time correctly.
538
+
- Feature enablement/disablement test
539
+
485
540
Coverage:
486
541
487
542
-`pkg/apis/apps/v1`: `2023-05-26` - `71.7%`
@@ -500,9 +555,9 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
Yes, there are unit tests which make sure the field is correctly dropped
744
+
There are unit tests which make sure the field is correctly dropped
686
745
on feature enable and disabled, see [strategy tests](https://github.com/kubernetes/kubernetes/blob/23698d3e9f4f3b9738ba5a6fcefd17894a00624f/pkg/registry/apps/statefulset/strategy_test.go#L391-L417).
687
746
747
+
Feature enablement/disablement test will also be added when graduating to Beta as [TestStatefulSetStartOrdinalEnablement](https://github.com/kubernetes/kubernetes/blob/23698d3e9f4f3b9738ba5a6fcefd17894a00624f/pkg/registry/apps/statefulset/strategy_test.go#L473)
748
+
688
749
### Rollout, Upgrade and Rollback Planning
689
750
690
751
###### How can a rollout or rollback fail? Can it impact already running workloads?
@@ -699,11 +760,15 @@ rollout. Similarly, consider large clusters and how enablement/disablement
699
760
will rollout across nodes.
700
761
-->
701
762
702
-
This could happen when we downgrade the kube-controller-manager from maxUnavailable-enabled release to a disabled one,
703
-
but it will not impact running workloads, because `maxUnavailable` only works in rolling update.
763
+
It depends, like
764
+
765
+
- In a HA cluster, if part of the API servers enabled with this feature gate but the others not, then
766
+
for already running workloads, they will not be impacted in rolling update because the process is controlled
767
+
by the statefulset controller. But for newly created statefulset, their `maxUnavailable` field will
768
+
be dropped if requested a feature-gate disabled api-server.
704
769
705
-
But if a statefulset has a plenty of replicas, when rollingupdate, it will take more time comparing to
706
-
`maxUnavailable` enabled with a number greater than 1.
770
+
- With this feature enabled and a statefulset running into the rolling-update, then we disabled the feature-gate,
771
+
then this statefulset will be impacted as it will fall into the default behavior, updating a pod at one time.
707
772
708
773
###### What specific metrics should inform a rollback?
709
774
@@ -712,33 +777,41 @@ What signals should users be paying attention to when the feature is young
712
777
that might indicate a serious problem?
713
778
-->
714
779
715
-
When`maxUnavailable` enabled, and we're rolling update a statefulset, the number of pods brought down should
780
+
- As a normal user, with`maxUnavailable` enabled, and we're rolling update a statefulset, the number of pods brought down should
716
781
equal to the `maxUnavailable`, if not, we should rollback.
717
782
783
+
- As a administrator(as well for normal user), I can check the new metric `rolling-update-duration-seconds`, if we enabled the feature with
784
+
greater than 1 `maxUnavailable` set for statefulsets, but didn't see the duration fall down, then we may have some problems here.
785
+
This is not precise because rolling-update duration can be impacted by several factors, e.g. pulling down a big, new image, but
786
+
can be some indicators refer to.
787
+
718
788
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
719
789
720
790
No, but it will be tested manually before merging the PR.
721
791
722
792
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
793
+
723
794
No
724
795
725
796
### Monitoring Requirements
726
797
727
798
###### How can an operator determine if the feature is in use by workloads?
799
+
728
800
If their StatefulSet rollingUpdate section has the field maxUnavailable specified with
729
801
a value different than 1.
730
802
The below command should show maxUnavailable value:
731
803
```
732
804
kubectl get statefulsets -o yaml | grep maxUnavailable
733
805
```
734
806
735
-
Or refer to the new metric `rolling-update-duration`, it should exist.
807
+
Or refer to the new metric `rolling-update-duration-seconds`, it should exist.
736
808
737
809
###### How can someone using this feature know that it is working for their instance?
738
810
739
811
With feature enabled, set the `maxUnavailable` great than 1, and pay attention to the rolling update pods at a time,
740
812
it should equal to the `maxUnavailable`.
741
-
Or when setting the `maxUnavailable` great than 1, the `rolling-update-duration` should decrease.
813
+
Or refer to the `rolling-update-duration-seconds`, it can give some indicators generally. Like when setting the `maxUnavailable`
814
+
greater than 1, then the duration should descend.
742
815
743
816
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
744
817
@@ -757,7 +830,9 @@ These goals will help you determine what you need to measure (SLIs) in the next
757
830
question.
758
831
-->
759
832
760
-
I think it has little relevance with SLOs, but rolling update at a very low speed which impacts the running services.
833
+
Rolling update duration is impacted by a lot of factors, like the container hooks, image size, network, container logics,
834
+
so what we can roughly know here is for each individual statefulset. So generally, if we set the `maxUnavailable` to X, then
835
+
the rolling update duration should reduce X times. Keep in mind that, this is not precise.
761
836
762
837
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
763
838
@@ -766,11 +841,8 @@ Pick one more of these and delete the rest.
766
841
-->
767
842
768
843
-[x] Metrics
769
-
- Component exposing the metric: kube-controller-manager
0 commit comments