Skip to content

Conversation

@andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented Feb 6, 2025

  • One-line PR description: support node topologies in downward API

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Feb 6, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 6, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 6, 2025
@andrewsykim andrewsykim force-pushed the kep-4742 branch 2 times, most recently from c7bbaea to ac50d21 Compare February 6, 2025 21:37
@thockin
Copy link
Member

thockin commented Feb 6, 2025

Some notes from discussion this afternoon:

We don't really want per-node config because a pod might behave differently on different nodes (and because back-rev nodes are a thing).

We don't really want async copying from node->pod because env vars can only be set at pod start.

So we need a synchronous mechanism, which is why @munnerz went with Binding.

We don't want every scheduler to be responsible for setting topology labels, so we @munnerz added admission control.
That's the last synchronous hook-point before Kubelet could start running the pod.

I have heard now from folks who would like to expose OTHER node labels at the same time, at the discretion of the cluster admin, not the pod.

So why don't we just merge ALL of the Binding.metadata.labels into Pod.metadata.labels (without overwriting if they exist)? In fact, we could consider annotations, too, though that is speculative, so probably not.

We can offer a built-in admission e.g. "PodTopologyLabels" which copies the specific ones listed in this KEP.

If you are a cluster admin or operator and you want to expose labels to pods, add admission control on Binding.

It's not a terrible mechanism, IMO.

No magic virtual labels in kubelet.

No special copying labels to status.

Just a plain old merge of plain old data.

@thockin
Copy link
Member

thockin commented Feb 6, 2025

I have a one more question: this KEP proposes that ALL OF *.topology.k8s.io is for extensions. Is that too much? Should we make it *.x.topology.k8s.io (the x giving us room to add more prefixes if we need) ?

Copy link

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do this. If we want to make it easy, let's publish a container image that people can use to make the Node topology information as data within an emptyDir (or similar). That approach has zero in-tree code to maintain, and works.

We could also consider kubernetes/kubernetes#116993 as a mechanism to make the output from the init container available as the environment variables of a container (avoids customizing the app container image).


## Implementation History

- the `Summary`, `Motivation`, `Design Details` sections being merged, signaling SIG acceptance
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no backticks needed

This isn't implementation history.

@andrewsykim andrewsykim force-pushed the kep-4742 branch 8 times, most recently from c83eba4 to f62b038 Compare February 12, 2025 18:06
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve for API
/lgtm

Needs sig-node and PRR


### Goals

* Values from Node labels `topology.k8s.io/zone`, `topology.k8s.io/region` and `kubernetes.io/hostname` are made
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you dropping the *.topology.k8s.io affordance from this KEP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand, less scope is more gooder.

On the other hand, I thought it was expressly part of the original KEP that this be more extensible? Yes, users can install a mutating admission - is that oo heavyweight for this? How common will extensibility be?

I guess we can wait and see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I decided to only include the standard labels for now as it's safer and doing *.topology.k8s.io/* doesn't seem that much more extensible than what we doing now. Happy to revisit this after alpha once we get some user feedback.

* A feature gate, `PodTopologyLabelsAdmission` will be introduced in v1.33. Alpha and disabled by default.
The `PodTopologyLabels` admission plugin can only be set when this feature gate is enabled.
* The Binding REST implementation will be updated to copy all labels from `pods/binding` subresource into Pods.
At this point we do not remove labels, it is assumed that any labels in Binding are safe to expose via Downward API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any label key in the Binding already exists in the pod, it will be dropped and will NOT overwrite the pod's value.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
@sftim
Copy link

sftim commented Feb 13, 2025

I've suggested recording the alternatives we've discounted. We should do that unless the suggestions - eg #5146 (comment) - are really off target.

@andrewsykim would you be willing to revise the KEP to record those discarded alternative approaches?

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see if there's a scheduling metric we can watch, in case some bugs in this code cause failures during binding.


###### How can a rollout or rollback fail? Can it impact already running workloads?

Failure on rollout is unlikely since we do not override any existing topolgoy labels on Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/topolgoy/topology/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


###### How can an operator determine if the feature is in use by workloads?

Look for topology labels in Pod labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually no user control over the feature so this question is N/A.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Implementation History

- the Summary, Motivation, Design Details sections being merged, signaling SIG acceptance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other KEPs for the relevant type of information to put in this section

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@andrewsykim andrewsykim force-pushed the kep-4742 branch 2 times, most recently from 152260c to b774f6d Compare February 13, 2025 17:57
@thockin
Copy link
Member

thockin commented Feb 13, 2025

Given there is no API surface, really, I am not sure my vote counts, but I am LGTM :)


* A built-in Kubernetes admission plugin, `PodTopologyLabels` will be introduced in kube-apiserver
* The `PodTopologyLabels` admission plugin is responsible for mutating `pods/binding` subresource, adding topology labels matching the target Node.
* `PodTopologyLabels` admission will overwrite `topology.k8s.io/*` labels on Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of them? Or just those it plans to copy? I think we need to only overwrite what we plan to set - this way customer's mutating webhook is not affected and can copy additional labels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goals section details topology.k8s.io/zone, topology.k8s.io/region and kubernetes.io/hostname are the only label values that will be copied in the admission plugin (these will be copied onto the Binding).

The registry is what is responsible for copying labels from the (mutated) binding onto the Pod resource (at the same time as it persists the assigned spec.nodeName).

Based on The Binding REST implementation will be updated to copy all labels from pods/binding subresource into Pods. (below), the registry will be enhanced to always copy across all labels from the Binding onto the Pod (regardless whether topology.k8s.io or example.com), however will only overwrite the 3 known zone, region and hostname labels (as per At this point we will overwrite Pod labels in Binding that are allowed to be exposed via Downward API. below).

I do agree we need to ensure we are water-tight on defining what and where we expose and copy, as service providers will build upon these behaviours quickly.

* The `PodTopologyLabels` admission plugin is responsible for mutating `pods/binding` subresource, adding topology labels matching the target Node.
* `PodTopologyLabels` admission will overwrite `topology.k8s.io/*` labels on Pods.
* A feature gate, `PodTopologyLabelsAdmission` will be introduced in v1.33. Alpha and disabled by default.
The `PodTopologyLabels` admission plugin can only be set when this feature gate is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `PodTopologyLabels` admission plugin can only be set when this feature gate is enabled.
The `PodTopologyLabels` admission plugin will be enabled by default when this feature gate is enabled.

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits but generally LGTM for PRR.

#4742 needs lead-opted-in and v1.33 milestone so release team tracks this


## Implementation History

- - `v1.33`: initial KEP is accepeted and alpha implementation is complete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting here

items:
- path: "zone"
fieldRef:
fieldPath: metadata.labels['topology.kubernetes.io/zone']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion, thanks Sergey

@andrewsykim andrewsykim force-pushed the kep-4742 branch 3 times, most recently from 8cfe60a to d105e2d Compare February 13, 2025 19:04
Signed-off-by: Andrew Sy Kim <[email protected]>
Co-authored-by: docandrew <[email protected]>
@dchen1107 dchen1107 added this to the v1.33 milestone Feb 14, 2025
@dchen1107
Copy link
Member

/approve based on Tim and Sergey's review.

@SergeyKanzhelev
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2025
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve for PRR

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, dchen1107, johnbelamaric, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2025
@k8s-ci-robot k8s-ci-robot merged commit e7ff6a2 into kubernetes:master Feb 14, 2025
3 of 4 checks passed
## Proposal

Retreival of topology information will be achieved through the existing mechanism of passing down Pod labels using Downward API.
A new admission plugin, `PodTopologyLabels`, will be added to kube-apiserver. When enabled, it will mutate the `pods/binding` subresource,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like there are two separate (yet related) questions:

  • Whether to enable PodTopologyLabels the admission by default
  • Whether to allow certain pods to opt in or out this feature.

One potential issue I see is that if we do this by default for all pods, it might cause problems for users/pods that do not want this feature for whatever reason. And we're unlikely to get feedback from them until this is made the default.

##### e2e tests

E2E tests will be added to test the following scenarios:
* Pods contain topology labels when `PodTopologyLabels` admission plugin is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not contain any additional labels

E2E tests will be added to test the following scenarios:
* Pods contain topology labels when `PodTopologyLabels` admission plugin is enabled.
* A Pod using downward API to retrieve the underlying topology information about the Node
* A Pod attempting to use downward API to retrieve Node labels that are not the standard topology labels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a test case of consuming topology labels via downward API when the some labels are not set

@sftim
Copy link

sftim commented Mar 3, 2025

I wonder if we'd like to suggest a curated ValidatingAdmissionPolicy that rejects problem Pods. Eg pod is labelled as topology.k8s.io/region: mars but has a node selector that requires topology.k8s.io/region to be venus.

@munnerz
Copy link
Member

munnerz commented Mar 3, 2025

So why don't we just merge ALL of the Binding.metadata.labels into Pod.metadata.labels (without overwriting if they exist)? In fact, we could consider annotations, too, though that is speculative, so probably not.

Funny enough, this already is done for annotations here: https://github.com/kubernetes/kubernetes/blob/63ef2dd03de1560ace32e145c531e1148a54f827/pkg/registry/core/pod/storage/storage.go#L242-L247

So technically speaking it should be possible today to implement just the admission plugin, with the one caveat that the labels must be mapped to annotations, creating a confusing user experience.

That said, with label selectors often being used to configure things like traffic routing (which in some service meshes also encompassing service identity), it does feel slightly less scary to copy the scheduler and/or admission set values as annotations instead...

@munnerz
Copy link
Member

munnerz commented Mar 6, 2025

@andrewsykim can you update the title of this PR to reflect the correct KEP number? (4742, not 4724!)

@andrewsykim andrewsykim changed the title KEP-4724: support node topologies in downward API KEP-4742: support node topologies in downward API Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants