Skip to content

Commit e9aa5cc

Browse files
Support for per-zone PDB (#253)
* Support for per-zone PDB Includes; * validating webhook configuration to handle pod evictions * validating webhook to validate zpdb configurations * custom resource definition for zpdb configurations * eviction handler for enforcing a zone aware and partition aware pdb Co-authored-by: Andy Asp <[email protected]>
1 parent cc9aaff commit e9aa5cc

File tree

128 files changed

+40051
-43
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

128 files changed

+40051
-43
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
* `k8s.io/apimachinery` from `v0.33.1` to `v0.33.3`
1717
* `k8s.io/client-go` from `v0.33.1` to `v0.33.3`
1818
* [ENHANCEMENT] Automatically patch new validating and mutating rollout-operator webhooks with the self-signed CA if they are created after rollout-operator starts. #262
19+
* [ENHANCEMENT] Support for zone and partition aware pod disruption budgets, enabling finer control over pod eviction policies. #253
1920
* [BUGFIX] Always configure HTTP client with a timeout. #240
2021
* [BUGFIX] Use a StatefulSet's `.spec.serviceName` when constructing the delayed downscale endpoint for a pod. #258
2122

README.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,15 @@ Prometheus metrics endpoint.
132132

133133
Offers a `ValidatingAdmissionWebhook` that rejects the requests that decrease the number of replicas in objects labeled as `grafana.com/no-downscale: true`. See [Webhooks](#webhooks) section below.
134134

135+
#### `/pods/eviction`
136+
137+
Offers a `ValidatingAdmissionWebhook` which can apply a `ZoneAwarePodDisruptionBudget` to administer voluntary pod evictions. See [ZoneAwarePodDisruptionBudget](#zoneawarepoddisruptionbudget-zpdb) section below.
138+
139+
#### `/admission/zpdb-validation`
140+
141+
Offers a `ValidatingAdmissionWebhook` to validate `ZoneAwarePodDisruptionBudget` configuration files and will reject any misconfigured files.
142+
143+
135144
### RBAC
136145

137146
When running the `rollout-operator` as a pod, it needs a Role with at least the following privileges:
@@ -163,6 +172,14 @@ rules:
163172
- statefulsets/status
164173
verbs:
165174
- update
175+
- apiGroups:
176+
- rollout-operator.grafana.com
177+
resources:
178+
- zoneawarepoddisruptionbudgets
179+
verbs:
180+
- get
181+
- list
182+
- watch
166183
```
167184

168185
(Please see [Webhooks](#webhooks) section below for extra roles required when using the HTTPS server for webhooks.)
@@ -415,3 +432,95 @@ subjects:
415432

416433
Whenever the certificate expires, the `rollout-operator` will detect it and will restart, which will trigger the self-signed certificate generation again if it's configured.
417434
The default expiration for the self-signed certificate is 1 year and it can be changed by setting the flag `-server-tls.self-signed-cert.expiration`.
435+
436+
# ZoneAwarePodDisruptionBudget (ZPDB)
437+
438+
A custom `PodDisruptionBudget` is available for use with the `rollout-operator`.
439+
440+
This is for use with `StatefulSets` which span multiple logical zones and allows for the budget to be evaluated against the pods in other zones.
441+
442+
Unlike a regular `PodDisruptionBudget` which evaluates across all pods, the `ZoneAwarePodDisruptionBudget` evaluates against unavailable pod counts within a zone, and only allows an eviction if no other zone is disrupted.
443+
444+
This allows an operator to perform maintenance on a single zone whilst ensuring sufficient pod availability in other zones.
445+
446+
Consider the following topology where the `ZPDB` has `maxUnavailable` set to 1:
447+
448+
* StatefulSet `ingester-zone-a` manages pods `ingester-zone-a-0` and `ingester-zone-a-1`
449+
* StatefulSet `ingester-zone-b` manages pods `ingester-zone-b-0` and `ingester-zone-b-1`
450+
* StatefulSet `ingester-zone-c` manages pods `ingester-zone-c-0` and `ingester-zone-c-1`
451+
452+
When a pod eviction request is received, the availability of the pods in the other zones are considered, as well as the availability in the zone of the pod being evicted.
453+
454+
If `ingester-zone-a-0` is to be evicted, it will be allowed if there are no disruptions in either zone `b` or zone `c`.
455+
456+
If `ingester-zone-a-1` has failed and `ingester-zone-a-0` is to be evicted, this will not be allowed since `maxUnavailable` of 1 is only allowed within this zone.
457+
458+
If `maxUnavailable` is 2, `ingester-zone-a-0` eviction would be granted since zone `a` can have 2 unavailable nodes, and there are no disruptions in zone `b` or zone `c`.
459+
460+
If `ingester-zone-a-0` is to be evicted, and `ingester-zone-b-0` has failed, the eviction request will be denied regardless of the value of `maxUnavailable` because another zone is already disrupted.
461+
462+
*A pod eviction is only allowed if the number of unavailable pods is within the maximum unavailability threshold for the zone and no other zone has a disruption.*
463+
464+
## Partition awareness
465+
466+
The `ZPDB` can be configured for partition awareness. This is intended for workloads like Mimir's ingesters running with ingest storage, where we can tolerate some unavailability in different zones, provided each partition has sufficient availability.
467+
468+
In this configuration, the `ZPDB` determines the partition for a pod being evicted, and considers this eviction against the unavailable counts for ALL pods which serve this partition.
469+
470+
*A pod eviction is only allowed if the number of unavailable pods serving a specific partition is less than the `maxUnavailable` value.*
471+
472+
Using the same topology as the previous section where the `ZPDB` has `maxUnavailable=1`;
473+
474+
If `ingester-zone-b-0` has failed and `ingester-zone-a-1` is to be evicted, it will be allowed as there are no disruptions in either zone `b` or zone `c` for partition `1`.
475+
476+
If `ingester-zone-b-0` has failed and `ingester-zone-a-0` is to be evicted, it will be denied as the partition `0` in zone `b` is disrupted.
477+
478+
## Operations
479+
480+
### Setup
481+
482+
The `ZoneAwarePodDisruptionBudget` is provided as a custom resource.
483+
484+
A pod eviction webhook is registered for approving voluntary pod eviction requests, and a validating webhook is registered for validating `ZPDB` objects.
485+
486+
The following is required to enable the `ZoneAwarePodDisruptionBudget`;
487+
488+
* a custom resource definition for the `ZoneAwarePodDisruptionBudget` kind - a sample is provided in [development](./development/zone-aware-pod-disruption-budget-custom-resource-definition.yaml)
489+
* a `ValidatingWebhookConfiguration` for registering the `rollout-operator` for pod evictions - a sample is provided in [development](./development/eviction-webhook.yaml)
490+
* a `ZoneAwarePodDisruptionBudget` kind for each set of `StatefulSets` - see below
491+
492+
Example `ZoneAwarePodDisruptionBudget`;
493+
494+
```yaml
495+
apiVersion: rollout-operator.grafana.com/v1
496+
kind: ZoneAwarePodDisruptionBudget
497+
metadata:
498+
name: ingester-rollout
499+
namespace: namespace
500+
labels:
501+
name: ingester-rollout
502+
spec:
503+
maxUnavailable: 1
504+
selector:
505+
matchLabels:
506+
rollout-group: ingester
507+
# podNamePartitionRegex: "[a-z\\-]+-zone-[a-z]-([0-9]+)"
508+
# podNameRegexGroup: 1
509+
```
510+
511+
### Configuration options
512+
513+
The exact resource attributes should be referenced via the provided custom resource definition file (see above).
514+
515+
Functionality includes the ability to;
516+
517+
* set a fixed max unavailable pod threshold
518+
* set the unavailable pod threshold as a percentage. This can only be used in classic zones and can not be used with partition awareness. The percentage is calculated against the StatefulSet's `spec.Replica` count.
519+
* set the selector to match the applicable Pods and StatefulSets
520+
* set the regular expression to determine a partition name from a pod name (if using partition awareness)
521+
522+
Note - `maxUnavailable` can be set to 0. In this case no voluntary evictions in any zone will be allowed.
523+
524+
Note - a validating webhook configuration is provided in [development](./development/zone-aware-pod-disruption-budget-validating-webhook.yaml) which allows the `rollout-operator` to verify a `ZoneAwarePodDisruptionBudget` configuration being created or updated. This will ensure that no invalid configuration can be applied.
525+
526+
Note - the `podNameRegexGroup` allows for the capture group index to be set. This is required if the partition regex has more than one set of groupings `(...)` in the expression. 1-based indexing is used, such that 1 will match the first parenthesized capture group.

cmd/rollout-operator/main.go

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ import (
3737
"github.com/grafana/rollout-operator/pkg/controller"
3838
"github.com/grafana/rollout-operator/pkg/instrumentation"
3939
"github.com/grafana/rollout-operator/pkg/tlscert"
40+
"github.com/grafana/rollout-operator/pkg/zpdb"
4041
)
4142

4243
const defaultServerSelfSignedCertExpiration = model.Duration(365 * 24 * time.Hour)
@@ -206,19 +207,23 @@ func main() {
206207
dynamicClient, err := dynamic.NewForConfigAndClient(kubeConfig, httpClient)
207208
check(errors.Wrap(err, "failed to init dynamicClient"))
208209

210+
// watches for validating webhooks being added - this is only started if the TLS server is started
209211
webhookObserver := tlscert.NewWebhookObserver(kubeClient, cfg.kubeNamespace, logger)
210212

211-
// Start TLS server if enabled.
212-
maybeStartTLSServer(cfg, httpRT, logger, kubeClient, restart, metrics, webhookObserver)
213+
// controller for pod eviction - this is only started if the TLS server is started
214+
evictionController := zpdb.NewEvictionController(kubeClient, dynamicClient, cfg.kubeNamespace, logger)
213215

214-
// Init the controller.
216+
maybeStartTLSServer(cfg, httpRT, logger, kubeClient, restart, metrics, evictionController, webhookObserver)
217+
218+
// Init the controller
215219
c := controller.NewRolloutController(kubeClient, restMapper, scaleClient, dynamicClient, cfg.kubeNamespace, httpClient, cfg.reconcileInterval, reg, logger)
216220
check(errors.Wrap(c.Init(), "failed to init controller"))
217221

218222
// Listen to sigterm, as well as for restart (like for certificate renewal).
219223
go func() {
220224
waitForSignalOrRestart(logger, restart)
221225
c.Stop()
226+
evictionController.Stop()
222227
webhookObserver.Stop()
223228
}()
224229

@@ -240,7 +245,7 @@ func waitForSignalOrRestart(logger log.Logger, restart chan string) {
240245
}
241246
}
242247

243-
func maybeStartTLSServer(cfg config, rt http.RoundTripper, logger log.Logger, kubeClient *kubernetes.Clientset, restart chan string, metrics *metrics, vwo *tlscert.WebhookObserver) {
248+
func maybeStartTLSServer(cfg config, rt http.RoundTripper, logger log.Logger, kubeClient *kubernetes.Clientset, restart chan string, metrics *metrics, evictionController *zpdb.EvictionController, webhookObserver *tlscert.WebhookObserver) {
244249
if !cfg.serverTLSEnabled {
245250
level.Info(logger).Log("msg", "tls server is not enabled")
246251
return
@@ -281,18 +286,30 @@ func maybeStartTLSServer(cfg config, rt http.RoundTripper, logger log.Logger, ku
281286
}
282287

283288
// Start monitoring for validating webhook configurations and patch if required
284-
check(vwo.Init(webHookListener))
285-
289+
check(webhookObserver.Init(webHookListener))
286290
}
287291

292+
// Start monitoring for zpdb configurations and pods
293+
check(evictionController.Start())
294+
288295
prepDownscaleAdmitFunc := func(ctx context.Context, logger log.Logger, ar v1.AdmissionReview, api *kubernetes.Clientset) *v1.AdmissionResponse {
289296
return admission.PrepareDownscale(ctx, rt, logger, ar, api, cfg.useZoneTracker, cfg.zoneTrackerConfigMapName)
290297
}
291298

299+
podEvictionFunc := func(ctx context.Context, _ log.Logger, ar v1.AdmissionReview, _ *kubernetes.Clientset) *v1.AdmissionResponse {
300+
return evictionController.HandlePodEvictionRequest(ctx, ar)
301+
}
302+
303+
zpdbValidationFunc := func(ctx context.Context, l log.Logger, ar v1.AdmissionReview, _ *kubernetes.Clientset) *v1.AdmissionResponse {
304+
return admission.ZoneAwarePdbValidatingWebhookHandler(ctx, l, ar)
305+
}
306+
292307
tlsSrv, err := newTLSServer(cfg, logger, cert, metrics)
293308
check(errors.Wrap(err, "failed to create tls server"))
294309
tlsSrv.Handle(admission.NoDownscaleWebhookPath, admission.Serve(admission.NoDownscale, logger, kubeClient))
295310
tlsSrv.Handle(admission.PrepareDownscaleWebhookPath, admission.Serve(prepDownscaleAdmitFunc, logger, kubeClient))
311+
tlsSrv.Handle(zpdb.PodEvictionWebhookPath, admission.Serve(podEvictionFunc, logger, kubeClient))
312+
tlsSrv.Handle(admission.ZpdbValidatorWebhookPath, admission.Serve(zpdbValidationFunc, logger, kubeClient))
296313
check(errors.Wrap(tlsSrv.Start(), "failed to start tls server"))
297314
}
298315

development/README.md

Lines changed: 118 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,126 @@
1-
This directory contains Kubernetes manifests to start an instance of the rollout-operator locally.
1+
# Quick Start
2+
3+
This directory contains Kubernetes manifests to start an instance of the `rollout-operator` locally.
24

35
To use it:
46

5-
* Build the rollout-operator image: `make build-image`
7+
* Build the `rollout-operator` image: `make build-image`
68
* Make the image available to your Kubernetes cluster (not required for use with Docker Desktop)
79
* Apply the Kubernetes manifests: `./apply.sh`
8-
* Port forward to the operator service: `kubectl --namespace=rollout-operator-development port-forward svc/rollout-operator 8080:80`
10+
* Port forward to the operator service;
11+
```
12+
kubectl --namespace=rollout-operator-development port-forward svc/rollout-operator 8080:80
13+
kubectl --namespace=rollout-operator-development port-forward svc/rollout-operator 8443:443
14+
```
915
* Port forward to the Jaeger UI: `kubectl --namespace=rollout-operator-development port-forward svc/jaeger 16686:16686`
1016

11-
You'll then be able to access the rollout operator at `http://localhost:8080`, and the Jaeger tracing UI at `http://localhost:16686`.
17+
You'll then be able to access the rollout operator at `http://localhost:8080`, access the rollout operator webhooks at `http://localhost:8443` and the Jaeger tracing UI at `http://localhost:16686`.
18+
19+
You can use the StatefulSets to exercise the operator across a multi-zone `test-app` environment.
20+
21+
# ZoneAwarePodDisruptionBudget (ZPDB)
22+
23+
Included is a `ZoneAwarePodDisruptionBudget` which can be used to enforce a multi-zone pod disruption budget.
24+
25+
By default, this will be applied to the `test-app` Pods and StatefulSets.
26+
27+
To disable this functionality from the `test-app`;
28+
29+
```text
30+
kubectl delete -f rollout-operator-zone-aware-pod-disruption-budget.yaml
31+
```
32+
33+
# Minikube & Docker Desktop
34+
35+
Note - if you are using local `Docker Desktop` and `minikube` and you intend to use locally built images, ensure that you are using Minikube's Docker Daemon so any images you build will be available inside the Minikube cluster.
36+
37+
Additionally, ensure that you set the container image reference to `imagePullPolicy: Never`.
38+
39+
```
40+
cd ~/rollout-operator
41+
minikube start
42+
eval $(minikube docker-env)
43+
make build-image
44+
45+
(
46+
cd development
47+
./apply.sh
48+
)
49+
```
50+
51+
# Useful commands
52+
53+
The following are useful commands when running tests with the `rollout-operator` and the `ZoneAwarePodDisruptionBudget`
54+
55+
List custom resource definitions;
56+
```
57+
kubectl get crds -n rollout-operator-development
58+
```
59+
60+
List custom resources;
61+
```
62+
kubectl get zoneawarepoddisruptionbudgets -n rollout-operator-development
63+
```
64+
65+
List the custom resource configuration by name;
66+
```
67+
kubectl get zoneawarepoddisruptionbudget test-app -n rollout-operator-development -o yaml
68+
```
69+
70+
Port forward to the rollout-operator;
71+
```
72+
kubectl --namespace=rollout-operator-development port-forward svc/rollout-operator 8443:443
73+
```
74+
75+
Tail logs for the rollout-operator;
76+
```
77+
kubectl logs -f `kubectl get pods -n rollout-operator-development | grep rollout-operator | awk '{print $1}'` -n rollout-operator-development
78+
```
79+
80+
Test the pod eviction and `ZPDB`;
81+
```
82+
# watch the pod status
83+
while true; do kubectl get pods -n rollout-operator-development; sleep 1; clear; done
84+
```
85+
86+
```
87+
# in another shell - issue a drain for all pods
88+
kubectl drain --pod-selector rollout-group=test-app minikube &; sleep 5; kubectl uncordon minikube
89+
```
90+
91+
Note - in the above example the `uncordon` is important as without this the drained pods in the first zone will not be re-deployed. Until these pods are running again the next zone's pods will not be evicted.
92+
93+
This is a limitation of running multiple zones within a single kubernetes node. In a usual deployment each zone would have its pods distributed across different nodes.
94+
95+
Apply an updated `ZPDB`
96+
```
97+
# make changes in rollout-operator-zone-aware-pod-disruption-budget.yaml
98+
kubectl apply -f rollout-operator-zone-aware-pod-disruption-budget.yaml
99+
```
12100

13-
You can use the `test-app` StatefulSet to exercise the operator.
101+
Test the pod eviction webhook manually;
102+
```
103+
curl --insecure -X POST "https://127.0.0.1:8443/admission/pod-eviction" \
104+
-H "Content-Type: application/json" \
105+
--data '{
106+
"apiVersion": "admission.k8s.io/v1",
107+
"kind": "AdmissionReview",
108+
"request": {
109+
"uid": "test-eviction-123",
110+
"kind": {"group": "policy", "version": "v1", "kind": "Eviction"},
111+
"resource": {"group": "policy", "version": "v1", "resource": "evictions"},
112+
"name": "test-app-zone-a-0",
113+
"namespace": "rollout-operator-development",
114+
"operation": "CREATE",
115+
"subResource": "eviction",
116+
"userInfo": {"username": "test-user"},
117+
"dryRun": false,
118+
"object": {
119+
"apiVersion": "policy/v1",
120+
"kind": "Eviction",
121+
"metadata": {"name": "test-app-zone-a-0", "namespace": "rollout-operator-development"},
122+
"deleteOptions": {"gracePeriodSeconds": 30}
123+
}
124+
}
125+
}'
126+
```

development/apply.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,5 @@ select yn in "Yes" "No"; do
1919
done
2020

2121
kubectl apply --wait -f "$SCRIPT_DIR/namespace.yaml"
22-
find "$SCRIPT_DIR" -type f -name '*.yaml' -not -name 'namespace.yaml' -exec kubectl apply --namespace=rollout-operator-development --wait -f {} \;
22+
kubectl apply --wait -f "$SCRIPT_DIR/zone-aware-pod-disruption-budget-custom-resource-definition.yaml"
23+
find "$SCRIPT_DIR" -type f -name '*.yaml' -not -name 'namespace.yaml' -not -name 'zone-aware-pod-disruption-budget-custom-resource-definition.yaml' -exec kubectl apply --namespace=rollout-operator-development --wait -f {} \;

development/eviction-webhook.yaml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
apiVersion: admissionregistration.k8s.io/v1
2+
kind: ValidatingWebhookConfiguration
3+
metadata:
4+
name: pod-eviction-rollout-operator-development
5+
labels:
6+
grafana.com/inject-rollout-operator-ca: "true"
7+
grafana.com/namespace: rollout-operator-development
8+
webhooks:
9+
- name: pod-eviction-rollout-operator-development.grafana.com
10+
clientConfig:
11+
service:
12+
namespace: rollout-operator-development
13+
name: rollout-operator
14+
path: /admission/pod-eviction
15+
port: 443
16+
rules:
17+
- operations:
18+
- CREATE
19+
apiGroups:
20+
-
21+
apiVersions:
22+
- v1
23+
resources:
24+
- pods/eviction
25+
scope: Namespaced
26+
27+
admissionReviewVersions:
28+
- v1
29+
namespaceSelector:
30+
matchLabels:
31+
kubernetes.io/metadata.name: rollout-operator-development
32+
failurePolicy: Fail
33+
sideEffects: None

development/rollout-operator-role.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,12 @@ rules:
2828
- statefulsets/status
2929
verbs:
3030
- update
31+
- apiGroups:
32+
- rollout-operator.grafana.com
33+
resources:
34+
- zoneawarepoddisruptionbudgets
35+
verbs:
36+
- get
37+
- list
38+
- watch
39+

0 commit comments

Comments
 (0)