Skip to content

Conversation

helayoty
Copy link
Member

@helayoty helayoty commented Aug 11, 2025

  • One-line PR description: Add numeric comparison operators (Lt, Gt) to Tolerations for SLA-based scheduling with threshold-based taint matching.
  • Other comments: cc @kubernetes/sig-scheduling-misc @kubernetes/sig-apps-misc

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Aug 11, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 11, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Aug 11, 2025
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 11, 2025
@k8s-ci-robot k8s-ci-robot requested a review from dom4ha August 11, 2025 21:48
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Aug 11, 2025
@k8s-ci-robot k8s-ci-robot requested a review from macsko August 11, 2025 21:48
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 11, 2025
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

## Summary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include the scenario of adding tolerations on the semantic version comparison? It will likely require either a new operator or some way to express that a taint's string needs to be parsed as a semver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current scope for this KEP is integer-only support for SLA/failure-probability because it’s the simplest and safest to implement (avoids floating-point parsing and complex type semantics).

I agree that semver comparisons are a valid and important future use case (e.g., kubelet version taints, device firmware versions). To keep this KEP narrowly scoped and implementable, I propose to document the semantic version comparison in a new Future Work section. And also consider if we would add this to the node affinity as well. wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One consideration here is what is the delta here. It is a lot of red tape to run thru scenarios and API reviews. Making two changes together may be much easier than splitting into separate KEPs.

I am not objecting to this scope though. This is not blocking proposal, but I believe it will be very useful.

BTW, we may also want to support semver in CEL as well =). So potentially another scope increase

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the scope to include both semver and CEL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho fyi for semver addition.

Copy link
Member Author

@helayoty helayoty Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, after discussing it with @sanposhiho @macsko, we will have a separate KEP for how the semver can be implemented onto nodeaffinity, taint/toleration, and CEL, along with evaluating use case for each.

Created issue to track this feature ->#5500

@helayoty helayoty force-pushed the helayoty/enable-sla-based-schedule branch from 2a36559 to c9e75ba Compare August 15, 2025 23:18
@helayoty helayoty moved this from Needs Triage to In Progress in SIG Scheduling Aug 15, 2025
Signed-off-by: Heba Elayoty <[email protected]>
@macsko
Copy link
Member

macsko commented Aug 22, 2025

/cc @dom4ha @sanposhiho

@helayoty helayoty requested a review from everpeace August 22, 2025 15:39
- Enhance CEL semver library with additional comparison functions for consistent version handling.
- Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`).
- Provide unified semantic version handling across scheduling and admission control.
- Backward compatible and opt‑in via feature gates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fair to add a goal to ensure that the addition of new operators has zero operational effect on existing pod scheduling outcomes using the Equality or Exists operators? I'm thinking of the various switch toleration.Operator bits in source and assuming that adding more operators should have zero cost to existing scenarios.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, totally fair. Added a goal in addition to performance tests.


As a cluster operator, I want a default repel from spot (low-SLA) nodes so that only workloads that explicitly tolerate them can land there.

I also want to set numeric SLA thresholds in tolerations (e.g., `Gt 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider: for every scenario that is (strictly speaking) possible using existing semantics (e.g., NodeAffinity) it might be helpful to include that scenario here. A before/after type picture. Not wanting to add too much information to the KEP, but that would help folks to assess the ergonomic improvements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Differed to the separate KEP we will create for semver/CEL

- key: node.kubernetes.io/sla
operator: Gt
value: "750"
effect: NoSchedule

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also inclue an example pod spec that will explicitly not get scheduled onto a spot node, e.g.,

---
# Critical workload will not be scheduled until a suitable high reliability node has capacity
apiVersion: v1
kind: Pod
metadata:
  name: critical-workload
spec:
  tolerations:
  - key: node.kubernetes.io/sla
    operator: Gt
    value: "950"
    effect: NoSchedule

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the example.


This ensures DRA allocations are both resource-correct and reliability-compliant.

**Example Configuration:**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the ResourceSlice example that includes the taint that ResourceClaim "gpu-claim-high-sla" will tolerate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

value: "24"
effect: NoSchedule
---
# Batch training workload tolerates degraded devices

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's call this a "Short-lived batch training workload" to distinguish from other (very normal) batch workloads that would not want to be interrupted in an hour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.


#### Story 6 — Kubernetes version compatibility for critical workloads

As a cluster operator managing a mixed-version Kubernetes cluster during rolling upgrades, I want to ensure critical workloads only run on nodes with Kubernetes version >= 1.20.0 due to specific API features they require, while allowing development workloads to tolerate older versions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: to make this more current, let's add 1.33/1.34 as our two example versions here

Copy link
Member Author

@helayoty helayoty Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Differed to the future KEP for semver/CEL

spec:
tolerations:
- key: node.kubernetes.io/version
operator: SemverGt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SemverGe for this and below in this example?

Copy link
Member Author

@helayoty helayoty Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Differed to the future KEP for semver/CEL

- Values must conform to [Semantic Versioning 2.0.0](https://semver.org/) specification requiring exactly 3 components (major.minor.patch)
- Supports both prefixed (`"v1.20.1"`) and non-prefixed (`"1.20.1"`) formats following Kubernetes conventions
- Invalid semver strings (e.g., `"1.20"`, `"1.20.1.1"`, `"1"`) cause validation errors
- Pre-release versions (e.g., `1.20.0-alpha.1`) have lower precedence than release versions (`1.20.0`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the pre-release and build metadata points are part of semver, so we could drop them. Or if being super explicit is helpful, let's add the "per semver specification" suffix to the Pre-release versions line item as well.

Copy link
Member Author

@helayoty helayoty Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Differed to the future KEP for semver/CEL

- Pre-release versions (e.g., `1.20.0-alpha.1`) have lower precedence than release versions (`1.20.0`)
- Build metadata (`+build.123`) is ignored in comparisons per semver specification

- **Type Mismatch Handling**: If toleration and taint values cannot be parsed as the same type (integer vs semver), the toleration does not match. This prevents unexpected behavior and ensures type safety.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies I could have two nodes on my cluster like this:

apiVersion: v1
kind: Node
metadata:
  name: node-1
spec:
  taints:
  - key: node.kubernetes.io/foo
    value: "100"
    effect: NoExecute
---
apiVersion: v1
kind: Node
metadata:
  name: node-2
spec:
  taints:
  - key: node.kubernetes.io/foo
    value: "v1.34.0"
    effect: NoExecute

And that the scheduler would simply schedule workloads with proper tolerations using either of the above semantics (int comparison or semver comparison) to the right node.

Also, apiserver has no preference between either of the below:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-service
spec:
  template:
    spec:
      tolerations:
      - key: node.kubernetes.io/sla
        operator: Gt
        value: "950"
        effect: NoExecute
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-service
spec:
  template:
    spec:
      tolerations:
      - key: node.kubernetes.io/sla
        operator: Gt
        value: "v1.33.0"
        effect: NoExecute

During scheduling, if we encounter a matching node taint whose value is of a different type than the workload toleration (int vs semver) then we simply continue and rule out that node as a candidate for scheduling.

Just thinking out loud here, did I get that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Differed to the future KEP for semver/CEL


- **Parsing Overhead**: Each taint/toleration match with comparison operators requires parsing attempts (integer first, then semver), adding computational cost.

> Note: A taint like `foo=95.5:NoSchedule` or `foo=v1.20.1:NoSchedule` is valid since taint values follow label values syntax. The type detection and parsing occurs during toleration matching, not validation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this observation worth including as a Risk? E.g.:

**Risk**: Invalid taints meant to be used with the new comparison operators (e.g., `node.kubernetes.io/sla=95.5` and `node.kubernetes.io/version=1`) are not detected at admission time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added.

@helayoty helayoty force-pushed the helayoty/enable-sla-based-schedule branch from 5df1161 to cd9d4a3 Compare August 26, 2025 08:40
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: helayoty
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric, sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Heba Elayoty <[email protected]>
@helayoty helayoty force-pushed the helayoty/enable-sla-based-schedule branch from e5b0b7b to 4e1a1cb Compare August 26, 2025 10:14
@helayoty helayoty moved this from Needs Triage to Backlog in SIG Apps Aug 26, 2025
@helayoty helayoty moved this from Backlog to Needs Review in SIG Apps Aug 26, 2025

nVal, nErr := strconv.ParseInt(taintVal, 10, 64)
if nErr != nil {
return false // Invalid taint value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't returning just false here too simple? How would the user know that the node was rejected because of non-numeric taint?

The following scenarios need to be covered in integration tests:

- Feature gate's enabling/disabling
- **Scheduler Integration Tests:** will be extended to cover the new taints cases introduced in this KEP:(pkg/scheduler/framework/plugins/tainttoleration/taint_toleration_test.go)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You linked the unit tests. Integration tests should be somewhere in the test/integration/scheduler package or test/integration/scheduler_perf for performance tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Review
Status: In Progress
Development

Successfully merging this pull request may close these issues.

8 participants