Skip to content

Conversation

HirazawaUi
Copy link
Contributor

@HirazawaUi HirazawaUi commented Aug 23, 2025

  • One-line PR description: This KEP aims to ensure that restarting the kubelet for a short period does not affect the status of pods on the node.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 23, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: HirazawaUi
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 23, 2025
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 23, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp August 23, 2025 12:07
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 23, 2025
@HirazawaUi HirazawaUi force-pushed the kep-4781 branch 2 times, most recently from eca940b to acbdb7e Compare August 25, 2025 15:34
@HirazawaUi HirazawaUi changed the title [WIP] KEP-4781 restarting kubelet does not change pod status KEP-4781 restarting kubelet does not change pod status Aug 25, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 25, 2025
By preserving the old state without immediate health checks, there is a delay in recognizing containers that have become unhealthy during or after kubelet's downtime. Services relying on Pod readiness for service discovery might continue directing traffic to Pods with containers that are no longer healthy but are still reported as Ready.
We plan to immediately trigger a probe after that to reduce the risk caused by such delays.

## Design Details
Copy link
Contributor Author

@HirazawaUi HirazawaUi Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not refer to the implementation approach of the previous KEP. After reviewing the POC PR related to that KEP, I found the implementation process somewhat cumbersome, and it also presented some potential edge case issues.

After tracing the pod status transition process, I adopted a new implementation method to achieve the goal: consistently relying on the detection results of the probeManager. This approach simplifies the implementation and helps us avoid certain edge cases. And in this section, the behavioral differences of kubelet under several scenarios are also analyzed. Could you please take a look?

My POC PR: kubernetes/kubernetes#133676

@SergeyKanzhelev @thockin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants