Skip to content

Commit e5b0b7b

Browse files
committed
Address feedback
Signed-off-by: Heba Elayoty <[email protected]>
1 parent cd9d4a3 commit e5b0b7b

File tree

3 files changed

+142
-76
lines changed

3 files changed

+142
-76
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 5471
2+
alpha:
3+
approver: "@soltysh"

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

Lines changed: 137 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,80 @@
11
# KEP-5471: Extended Toleration Operators for Threshold-Based Placement
22

33
<!-- toc -->
4-
- [Release Signoff Checklist](#release-signoff-checklist)
5-
- [Summary](#summary)
6-
- [Motivation](#motivation)
7-
- [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone)
8-
- [Goals](#goals)
9-
- [Non-Goals](#non-goals)
10-
- [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads)
11-
- [Proposal](#proposal)
12-
- [User Stories (Optional)](#user-stories-optional)
13-
- [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes)
14-
- [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos)
15-
- [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability)
16-
- [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management)
17-
- [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management)
18-
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
19-
- [Risks and Mitigations](#risks-and-mitigations)
20-
- [Scheduler Performance Regression](#scheduler-performance-regression)
21-
- [API Compatibility and Version Skew](#api-compatibility-and-version-skew)
22-
- [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing)
23-
- [Cross-SIG Impact](#cross-sig-impact)
24-
- [Design Details](#design-details)
25-
- [API Changes](#api-changes)
26-
- [Semantics](#semantics)
27-
- [Implementation](#implementation)
28-
- [Feature Gate Definition](#feature-gate-definition)
29-
- [Test Plan](#test-plan)
30-
- [Prerequisite testing updates](#prerequisite-testing-updates)
31-
- [Unit tests](#unit-tests)
32-
- [Integration tests](#integration-tests)
33-
- [e2e tests](#e2e-tests)
34-
- [Graduation Criteria](#graduation-criteria)
35-
- [Alpha](#alpha)
36-
- [Beta](#beta)
37-
- [GA](#ga)
38-
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
39-
- [Version Skew Strategy](#version-skew-strategy)
40-
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
41-
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
42-
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
43-
- [Monitoring Requirements](#monitoring-requirements)
44-
- [Dependencies](#dependencies)
45-
- [Scalability](#scalability)
46-
- [Troubleshooting](#troubleshooting)
47-
- [Implementation History](#implementation-history)
48-
- [Drawbacks](#drawbacks)
49-
- [Alternatives](#alternatives)
50-
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
4+
- [KEP-5471: Extended Toleration Operators for Threshold-Based Placement](#kep-5471-extended-toleration-operators-for-threshold-based-placement)
5+
- [Release Signoff Checklist](#release-signoff-checklist)
6+
- [Summary](#summary)
7+
- [Motivation](#motivation)
8+
- [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone)
9+
- [Goals](#goals)
10+
- [Non-Goals](#non-goals)
11+
- [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads)
12+
- [Proposal](#proposal)
13+
- [User Stories (Optional)](#user-stories-optional)
14+
- [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes)
15+
- [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos)
16+
- [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability)
17+
- [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management)
18+
- [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management)
19+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
20+
- [Risks and Mitigations](#risks-and-mitigations)
21+
- [Scheduler Performance Regression](#scheduler-performance-regression)
22+
- [API Compatibility and Version Skew](#api-compatibility-and-version-skew)
23+
- [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing)
24+
- [Cross-SIG Impact](#cross-sig-impact)
25+
- [Design Details](#design-details)
26+
- [API Changes](#api-changes)
27+
- [Semantics](#semantics)
28+
- [Implementation](#implementation)
29+
- [Feature Gate Definition](#feature-gate-definition)
30+
- [Test Plan](#test-plan)
31+
- [Prerequisite testing updates](#prerequisite-testing-updates)
32+
- [Unit tests](#unit-tests)
33+
- [Performance tests](#performance-tests)
34+
- [Integration tests](#integration-tests)
35+
- [e2e tests](#e2e-tests)
36+
- [Graduation Criteria](#graduation-criteria)
37+
- [Alpha](#alpha)
38+
- [Beta](#beta)
39+
- [GA](#ga)
40+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
41+
- [Version Skew Strategy](#version-skew-strategy)
42+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
43+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
44+
- [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
45+
- [Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior)
46+
- [Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement)
47+
- [What happens if we reenable the feature if it was previously rolled back?](#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back)
48+
- [Are there any tests for feature enablement/disablement?](#are-there-any-tests-for-feature-enablementdisablement)
49+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
50+
- [How can a rollout or rollback fail? Can it impact already running workloads?](#how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads)
51+
- [What specific metrics should inform a rollback?](#what-specific-metrics-should-inform-a-rollback)
52+
- [Were upgrade and rollback tested? Was the upgrade-\>downgrade-\>upgrade path tested?](#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested)
53+
- [Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?](#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc)
54+
- [Monitoring Requirements](#monitoring-requirements)
55+
- [How can an operator determine if the feature is in use by workloads?](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads)
56+
- [How can someone using this feature know that it is working for their instance?](#how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance)
57+
- [What are the reasonable SLOs (Service Level Objectives) for the enhancement?](#what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement)
58+
- [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service)
59+
- [Are there any missing metrics that would be useful to have to improve observability of this feature?](#are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature)
60+
- [Dependencies](#dependencies)
61+
- [Does this feature depend on any specific services running in the cluster?](#does-this-feature-depend-on-any-specific-services-running-in-the-cluster)
62+
- [Scalability](#scalability)
63+
- [Will enabling / using this feature result in any new API calls?](#will-enabling--using-this-feature-result-in-any-new-api-calls)
64+
- [Will enabling / using this feature result in introducing new API types?](#will-enabling--using-this-feature-result-in-introducing-new-api-types)
65+
- [Will enabling / using this feature result in any new calls to the cloud provider?](#will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider)
66+
- [Will enabling / using this feature result in increasing size or count of the existing API objects?](#will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects)
67+
- [Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?](#will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos)
68+
- [Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?](#will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components)
69+
- [Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?](#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc)
70+
- [Troubleshooting](#troubleshooting)
71+
- [How does this feature react if the API server and/or etcd is unavailable?](#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable)
72+
- [What are other known failure modes?](#what-are-other-known-failure-modes)
73+
- [What steps should be taken if SLOs are not being met to determine the problem?](#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem)
74+
- [Implementation History](#implementation-history)
75+
- [Drawbacks](#drawbacks)
76+
- [Alternatives](#alternatives)
77+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
5178
<!-- /toc -->
5279

5380
## Release Signoff Checklist
@@ -99,6 +126,7 @@ From a scheduling perspective, adding numeric operators to tolerations only adju
99126
- Add comparison operators to tolerations so pods can match taints like `node.kubernetes.io/sla=<int>` using thresholds.
100127
- Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`).
101128
- Backward compatible and opt‑in via a feature gate.
129+
- Zero operational performance impact on existing pod scheduling using `Equal` and `Exists` operators.
102130

103131
### Non-Goals
104132

@@ -110,13 +138,13 @@ From a scheduling perspective, adding numeric operators to tolerations only adju
110138

111139
In addition to general scheduling improvements, SLA‑aware opt‑in via tolerations has specific advantages for `Dynamic Resource Allocation (DRA)` and `AI/ML`:
112140

113-
- DRA steers GPUs/accelerators resource claims by node reliability: critical workloads get high‑SLA capacity while batch workloads use cheaper pools. Taints block risky pools and evict when capacity degrades.
141+
- DRA steers GPUs/accelerators resource claims by node reliability: critical workloads get high‑SLA capacity while interruptible batch workloads use cheaper pools. Taints block risky pools and evict when capacity degrades.
114142

115-
- AI/ML pipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover.
143+
- AI/ML pipelines can place latency‑sensitive inference on high‑SLA nodes while directing checkpoint-able batch workloads to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover.
116144

117145
| Benefit | Impact on DRA | Impact on AI/ML workloads |
118146
| ------------------------------ | --------------------------------------------------------- | ------------------------------------------------------- |
119-
| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; batch uses spot | Inference on reliable nodes; training on cheaper pools |
147+
| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; interruptible batch uses spot | Inference on reliable nodes; checkpoint-able training on cheaper pools |
120148
| **Workload-aware placement** | Different claim types target appropriate node tiers | Pipeline stages match their reliability requirements |
121149
| **Graceful preemption** | `NoExecute` provides controlled eviction timing | Predictable failover for training and serving workloads |
122150
| **Resource fairness** | Prevents monopolization of premium capacity | Teams share reliable accelerators fairly |
@@ -156,6 +184,18 @@ spec:
156184
operator: Gt
157185
value: "750"
158186
effect: NoSchedule
187+
---
188+
# Critical workload will not be scheduled until a suitable high reliability node has capacity
189+
apiVersion: v1
190+
kind: Pod
191+
metadata:
192+
name: critical-workload
193+
spec:
194+
tolerations:
195+
- key: node.kubernetes.io/sla
196+
operator: Gt
197+
value: "950"
198+
effect: NoSchedule
159199
```
160200
161201
#### Story 2 — AI inference service with strict SLOs
@@ -247,7 +287,31 @@ This ensures DRA allocations are both resource-correct and reliability-compliant
247287
**Example Configuration:**
248288
249289
```yaml
250-
# DRA claim with SLA constraints
290+
# High-SLA GPU device published by DRA driver
291+
apiVersion: resource.k8s.io/v1alpha4
292+
kind: ResourceSlice
293+
metadata:
294+
name: gpu-node-01-slice
295+
spec:
296+
driver: nvidia.com/gpu
297+
pool:
298+
name: gpu-node-01
299+
generation: 1
300+
devices:
301+
- name: gpu-node-01-device-0
302+
basic:
303+
attributes:
304+
memory: "32Gi"
305+
compute-capability: "8.6"
306+
capacity:
307+
count: 1
308+
# Driver applies SLA taint based on node reliability metrics
309+
taints:
310+
- key: node.kubernetes.io/sla
311+
value: "980" # 98% SLA
312+
effect: NoSchedule
313+
---
314+
# DRA claim with SLA constraints
251315
apiVersion: resource.k8s.io/v1alpha4
252316
kind: ResourceClaim
253317
metadata:
@@ -257,6 +321,12 @@ spec:
257321
requests:
258322
- name: gpu
259323
deviceClassName: nvidia-a100
324+
tolerations:
325+
# Only accept GPUs with SLA >= 950 (95%)
326+
- key: node.kubernetes.io/sla
327+
operator: Gt
328+
value: "950"
329+
effect: NoSchedule
260330
---
261331
# Pod using DRA claim with SLA requirements
262332
apiVersion: v1
@@ -318,7 +388,7 @@ spec:
318388
value: "24"
319389
effect: NoSchedule
320390
---
321-
# Batch training workload tolerates degraded devices
391+
# Batch Short-lived batch training workload tolerates degraded devices
322392
kind: ResourceClaim
323393
metadata:
324394
name: training-gpu-claim
@@ -358,7 +428,8 @@ spec:
358428

359429
**Mitigation**:
360430

361-
- Parse integers only when new operators are used (no impact on existing workloads)
431+
- Parse integers only when new operators are used.
432+
- Existing `Equal`/`Exists` operators execute identical code paths with no additional overhead.
362433
- Consider caching parsed values in scheduler data structures if performance issues arise
363434
- Feature gate allows disabling if performance problems occur
364435

@@ -482,19 +553,21 @@ func validateTolerations(tolerations []core.Toleration, fldPath *field.Path) fie
482553
```go
483554
// ToleratesTaint checks if the toleration tolerates the taint.
484555
func (t *Toleration) ToleratesTaint(taint *Taint) bool {
556+
switch t.Operator {
485557
// Existing key and effect matching logic...
486558
487-
switch t.Operator {
488-
// ...
559+
// Handle existing operators first. This ensures
560+
// zero performance impact for existing Equal/Exists scenarios.
489561
case TolerationOpLt, TolerationOpGt:
490562
// Feature gate check is not needed here as validation already handles it
491-
return compareNumericValues(t.Value, taint.Value, t.Operator)
563+
// Only parse values when comparison operators are actually used
564+
return compareValues(t.Value, taint.Value, t.Operator)
492565
default:
493566
return false
494567
}
495568
}
496569
497-
func compareNumericValues(tolerationVal, taintVal string, op TolerationOperator) bool {
570+
func compareValues(tolerationVal, taintVal string, op TolerationOperator) bool {
498571
tVal, tErr := strconv.ParseInt(tolerationVal, 10, 64)
499572
if tErr != nil {
500573
return false // Invalid toleration value
@@ -558,6 +631,11 @@ All core changes must be covered by unit tests, in both Taint API, validation, a
558631
- **Validation Tests:** ( pkg/apis/core/validation/validation_test.go)
559632
- `<package>`: `<date>` - `<test coverage>`
560633

634+
##### Performance tests
635+
636+
- Establish current scheduling latency for workloads using only `Equal`/`Exists` operators
637+
- Verify that enabling the feature gate with no comparison operators used shows no measurable performance difference.
638+
561639
##### Integration tests
562640

563641
<!--
@@ -698,17 +776,12 @@ in back-to-back releases.
698776
- Feedback collected from early adopters in SIG-Scheduling
699777
- Performance testing shows that there is no significant scheduler latency increase nor memory usage increase.
700778
- Implement feature for DRA APIs
701-
- Stress testing with:
702-
- 1000+ nodes with numeric taints
703-
- 10,000+ pods with numeric tolerations
704-
- Mixed numeric/string operator usage
779+
- Stress testing.
705780

706781
#### GA
707782

708783
- Evidence of real-world adoption.
709-
- Complete scalability validation:
710-
- 5000-node clusters with mixed taint/toleration workloads
711-
- No performance regressions under sustained load
784+
- Complete scalability validation.
712785

713786
### Upgrade / Downgrade Strategy
714787

@@ -1155,13 +1228,3 @@ information to express the idea and why it was not acceptable.
11551228
-->
11561229

11571230
## Infrastructure Needed (Optional)
1158-
1159-
<!--
1160-
Use this section if you need things from the project/SIG. Examples include a
1161-
new subproject, repos requested, or GitHub details. Listing these here allows a
1162-
SIG to get the process for these resources started right away.
1163-
-->
1164-
1165-
[kubernetes.io]: https://kubernetes.io/
1166-
[kubernetes/enhancements]: https://git.k8s.io/enhancements
1167-
[kubernetes/website]: https://git.k8s.io/website

keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ authors:
44
- "@jane.doe"
55
owning-sig: sig-scheduling
66
participating-sigs:
7-
- sig-node
8-
status: provisional
7+
- sig-apps
8+
status: implementable
99
creation-date: 2025-08-08
1010
reviewers:
1111
- "@SergeyKanzhelev"

0 commit comments

Comments
 (0)