Address feedback

helayoty · helayoty · commit e5b0b7b4e85c · 2025-08-26T03:03:29.000-07:00
Signed-off-by: Heba Elayoty &lt;heelayot@microsoft.com&gt;
diff --git a/keps/prod-readiness/sig-scheduling/5471.yaml b/keps/prod-readiness/sig-scheduling/5471.yaml
@@ -0,0 +1,3 @@
+kep-number: 5471
+alpha:
+  approver: "@soltysh"
diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md
@@ -1,53 +1,80 @@
 # KEP-5471: Extended Toleration Operators for Threshold-Based Placement
 
 <!-- toc -->
-- [Release Signoff Checklist](#release-signoff-checklist)
-- [Summary](#summary)
-- [Motivation](#motivation)
-  - [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone)
-  - [Goals](#goals)
-  - [Non-Goals](#non-goals)
-  - [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads)
-- [Proposal](#proposal)
-  - [User Stories (Optional)](#user-stories-optional)
-    - [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes)
-    - [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos)
-    - [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability)
-    - [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management)
-    - [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management)
-  - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
-  - [Risks and Mitigations](#risks-and-mitigations)
-    - [Scheduler Performance Regression](#scheduler-performance-regression)
-    - [API Compatibility and Version Skew](#api-compatibility-and-version-skew)
-    - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing)
-    - [Cross-SIG Impact](#cross-sig-impact)
-- [Design Details](#design-details)
-  - [API Changes](#api-changes)
-  - [Semantics](#semantics)
-  - [Implementation](#implementation)
-    - [Feature Gate Definition](#feature-gate-definition)
-  - [Test Plan](#test-plan)
-      - [Prerequisite testing updates](#prerequisite-testing-updates)
-      - [Unit tests](#unit-tests)
-      - [Integration tests](#integration-tests)
-      - [e2e tests](#e2e-tests)
-  - [Graduation Criteria](#graduation-criteria)
-    - [Alpha](#alpha)
-    - [Beta](#beta)
-    - [GA](#ga)
-  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
-  - [Version Skew Strategy](#version-skew-strategy)
-- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
-  - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
-  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
-  - [Monitoring Requirements](#monitoring-requirements)
-  - [Dependencies](#dependencies)
-  - [Scalability](#scalability)
-  - [Troubleshooting](#troubleshooting)
-- [Implementation History](#implementation-history)
-- [Drawbacks](#drawbacks)
-- [Alternatives](#alternatives)
-- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
+- [KEP-5471: Extended Toleration Operators for Threshold-Based Placement](#kep-5471-extended-toleration-operators-for-threshold-based-placement)
+  - [Release Signoff Checklist](#release-signoff-checklist)
+  - [Summary](#summary)
+  - [Motivation](#motivation)
+    - [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone)
+    - [Goals](#goals)
+    - [Non-Goals](#non-goals)
+    - [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads)
+  - [Proposal](#proposal)
+    - [User Stories (Optional)](#user-stories-optional)
+      - [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes)
+      - [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos)
+      - [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability)
+      - [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management)
+      - [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management)
+    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
+    - [Risks and Mitigations](#risks-and-mitigations)
+      - [Scheduler Performance Regression](#scheduler-performance-regression)
+      - [API Compatibility and Version Skew](#api-compatibility-and-version-skew)
+      - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing)
+      - [Cross-SIG Impact](#cross-sig-impact)
+  - [Design Details](#design-details)
+    - [API Changes](#api-changes)
+    - [Semantics](#semantics)
+    - [Implementation](#implementation)
+      - [Feature Gate Definition](#feature-gate-definition)
+    - [Test Plan](#test-plan)
+        - [Prerequisite testing updates](#prerequisite-testing-updates)
+        - [Unit tests](#unit-tests)
+        - [Performance tests](#performance-tests)
+        - [Integration tests](#integration-tests)
+        - [e2e tests](#e2e-tests)
+    - [Graduation Criteria](#graduation-criteria)
+      - [Alpha](#alpha)
+      - [Beta](#beta)
+      - [GA](#ga)
+    - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+    - [Version Skew Strategy](#version-skew-strategy)
+  - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
+    - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+          - [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
+          - [Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior)
+          - [Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement)
+          - [What happens if we reenable the feature if it was previously rolled back?](#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back)
+          - [Are there any tests for feature enablement/disablement?](#are-there-any-tests-for-feature-enablementdisablement)
+    - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+          - [How can a rollout or rollback fail? Can it impact already running workloads?](#how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads)
+          - [What specific metrics should inform a rollback?](#what-specific-metrics-should-inform-a-rollback)
+          - [Were upgrade and rollback tested? Was the upgrade-\>downgrade-\>upgrade path tested?](#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested)
+          - [Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?](#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc)
+    - [Monitoring Requirements](#monitoring-requirements)
+          - [How can an operator determine if the feature is in use by workloads?](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads)
+          - [How can someone using this feature know that it is working for their instance?](#how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance)
+          - [What are the reasonable SLOs (Service Level Objectives) for the enhancement?](#what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement)
+          - [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service)
+          - [Are there any missing metrics that would be useful to have to improve observability of this feature?](#are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature)
+    - [Dependencies](#dependencies)
+          - [Does this feature depend on any specific services running in the cluster?](#does-this-feature-depend-on-any-specific-services-running-in-the-cluster)
+    - [Scalability](#scalability)
+          - [Will enabling / using this feature result in any new API calls?](#will-enabling--using-this-feature-result-in-any-new-api-calls)
+          - [Will enabling / using this feature result in introducing new API types?](#will-enabling--using-this-feature-result-in-introducing-new-api-types)
+          - [Will enabling / using this feature result in any new calls to the cloud provider?](#will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider)
+          - [Will enabling / using this feature result in increasing size or count of the existing API objects?](#will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects)
+          - [Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?](#will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos)
+          - [Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?](#will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components)
+          - [Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?](#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc)
+    - [Troubleshooting](#troubleshooting)
+          - [How does this feature react if the API server and/or etcd is unavailable?](#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable)
+          - [What are other known failure modes?](#what-are-other-known-failure-modes)
+          - [What steps should be taken if SLOs are not being met to determine the problem?](#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem)
+  - [Implementation History](#implementation-history)
+  - [Drawbacks](#drawbacks)
+  - [Alternatives](#alternatives)
+  - [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
 <!-- /toc -->
 
 ## Release Signoff Checklist
@@ -99,6 +126,7 @@ From a scheduling perspective, adding numeric operators to tolerations only adju
 - Add comparison operators to tolerations so pods can match taints like `node.kubernetes.io/sla=<int>` using thresholds.
 - Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`).
 - Backward compatible and opt‑in via a feature gate.
+- Zero operational performance impact on existing pod scheduling using `Equal` and `Exists` operators.
 
 ### Non-Goals
 
@@ -110,13 +138,13 @@ From a scheduling perspective, adding numeric operators to tolerations only adju
 
 In addition to general scheduling improvements, SLA‑aware opt‑in via tolerations has specific advantages for `Dynamic Resource Allocation (DRA)` and `AI/ML`:
 
-- DRA steers GPUs/accelerators resource claims by node reliability: critical workloads get high‑SLA capacity while batch workloads use cheaper pools. Taints block risky pools and evict when capacity degrades.
+- DRA steers GPUs/accelerators resource claims by node reliability: critical workloads get high‑SLA capacity while interruptible batch workloads use cheaper pools. Taints block risky pools and evict when capacity degrades.
 
-- AI/ML pipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover.
+- AI/ML pipelines can place latency‑sensitive inference on high‑SLA nodes while directing checkpoint-able batch workloads to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover.
 
 | Benefit                        | Impact on DRA                                             | Impact on AI/ML workloads                               |
 | ------------------------------ | --------------------------------------------------------- | ------------------------------------------------------- |
-| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; batch uses spot | Inference on reliable nodes; training on cheaper pools  |
+| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; interruptible batch uses spot | Inference on reliable nodes; checkpoint-able training on cheaper pools  |
 | **Workload-aware placement**   | Different claim types target appropriate node tiers       | Pipeline stages match their reliability requirements    |
 | **Graceful preemption**        | `NoExecute` provides controlled eviction timing           | Predictable failover for training and serving workloads |
 | **Resource fairness**          | Prevents monopolization of premium capacity               | Teams share reliable accelerators fairly                |
@@ -156,6 +184,18 @@ spec:
     operator: Gt
     value: "750"
     effect: NoSchedule
+---
+# Critical workload will not be scheduled until a suitable high reliability node has capacity
+apiVersion: v1
+kind: Pod
+metadata:
+  name: critical-workload
+spec:
+  tolerations:
+  - key: node.kubernetes.io/sla
+    operator: Gt
+    value: "950"
+    effect: NoSchedule
 ```
 
 #### Story 2 — AI inference service with strict SLOs
@@ -247,7 +287,31 @@ This ensures DRA allocations are both resource-correct and reliability-compliant
 **Example Configuration:**
 
 ```yaml
-# DRA claim with SLA constraints
+# High-SLA GPU device published by DRA driver
+apiVersion: resource.k8s.io/v1alpha4
+kind: ResourceSlice
+metadata:
+  name: gpu-node-01-slice
+spec:
+  driver: nvidia.com/gpu
+  pool:
+    name: gpu-node-01
+    generation: 1
+  devices:
+  - name: gpu-node-01-device-0
+    basic:
+      attributes:
+        memory: "32Gi"
+        compute-capability: "8.6"
+      capacity:
+        count: 1
+    # Driver applies SLA taint based on node reliability metrics
+    taints:
+    - key: node.kubernetes.io/sla
+      value: "980"  # 98% SLA
+      effect: NoSchedule
+---
+# DRA claim with SLA constraints  
 apiVersion: resource.k8s.io/v1alpha4
 kind: ResourceClaim
 metadata:
@@ -257,6 +321,12 @@ spec:
     requests:
     - name: gpu
       deviceClassName: nvidia-a100
+      tolerations:
+      # Only accept GPUs with SLA >= 950 (95%)
+      - key: node.kubernetes.io/sla
+        operator: Gt
+        value: "950"
+        effect: NoSchedule
 ---
 # Pod using DRA claim with SLA requirements
 apiVersion: v1
@@ -318,7 +388,7 @@ spec:
       value: "24"
       effect: NoSchedule
 ---
-# Batch training workload tolerates degraded devices
+# Batch Short-lived batch training workload tolerates degraded devices
 kind: ResourceClaim
 metadata:
   name: training-gpu-claim
@@ -358,7 +428,8 @@ spec:
 
 **Mitigation**:
 
-- Parse integers only when new operators are used (no impact on existing workloads)
+- Parse integers only when new operators are used.
+- Existing `Equal`/`Exists` operators execute identical code paths with no additional overhead.
 - Consider caching parsed values in scheduler data structures if performance issues arise
 - Feature gate allows disabling if performance problems occur
 
@@ -482,19 +553,21 @@ func validateTolerations(tolerations []core.Toleration, fldPath *field.Path) fie
 ```go
 // ToleratesTaint checks if the toleration tolerates the taint.
 func (t *Toleration) ToleratesTaint(taint *Taint) bool {
+     switch t.Operator {
     // Existing key and effect matching logic...
     
-    switch t.Operator {
-    // ...
+    // Handle existing operators first. This ensures
+    // zero performance impact for existing Equal/Exists scenarios.
     case TolerationOpLt, TolerationOpGt:
         // Feature gate check is not needed here as validation already handles it
-        return compareNumericValues(t.Value, taint.Value, t.Operator)
+        // Only parse values when comparison operators are actually used
+        return compareValues(t.Value, taint.Value, t.Operator)
     default:
         return false
     }
 }
 
-func compareNumericValues(tolerationVal, taintVal string, op TolerationOperator) bool {
+func compareValues(tolerationVal, taintVal string, op TolerationOperator) bool {
     tVal, tErr := strconv.ParseInt(tolerationVal, 10, 64)
     if tErr != nil {
         return false // Invalid toleration value
@@ -558,6 +631,11 @@ All core changes must be covered by unit tests, in both Taint API, validation, a
 - **Validation Tests:** ( pkg/apis/core/validation/validation_test.go)
 - `<package>`: `<date>` - `<test coverage>`
 
+##### Performance tests
+
+- Establish current scheduling latency for workloads using only `Equal`/`Exists` operators
+- Verify that enabling the feature gate with no comparison operators used shows no measurable performance difference.
+
 ##### Integration tests
 
 <!--
@@ -698,17 +776,12 @@ in back-to-back releases.
 - Feedback collected from early adopters in SIG-Scheduling
 - Performance testing shows that there is no significant scheduler latency increase nor memory usage increase.
 - Implement feature for DRA APIs
-- Stress testing with:
-  - 1000+ nodes with numeric taints
-  - 10,000+ pods with numeric tolerations  
-  - Mixed numeric/string operator usage
+- Stress testing.
 
 #### GA
 
 - Evidence of real-world adoption.
-- Complete scalability validation:
-  - 5000-node clusters with mixed taint/toleration workloads
-  - No performance regressions under sustained load
+- Complete scalability validation.
 
 ### Upgrade / Downgrade Strategy
 
@@ -1155,13 +1228,3 @@ information to express the idea and why it was not acceptable.
 -->
 
 ## Infrastructure Needed (Optional)
-
-<!--
-Use this section if you need things from the project/SIG. Examples include a
-new subproject, repos requested, or GitHub details. Listing these here allows a
-SIG to get the process for these resources started right away.
--->
-
-[kubernetes.io]: https://kubernetes.io/
-[kubernetes/enhancements]: https://git.k8s.io/enhancements
-[kubernetes/website]: https://git.k8s.io/website
diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml
@@ -4,8 +4,8 @@ authors:
   - "@jane.doe"
 owning-sig: sig-scheduling
 participating-sigs:
-  - sig-node
-status: provisional
+  - sig-apps
+status: implementable
 creation-date: 2025-08-08
 reviewers:
   - "@SergeyKanzhelev"

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+kep-number: 5471`
	`2`	`+alpha:`
	`3`	`+ approver: "@soltysh"`