Skip to content

Conversation

pololowww
Copy link

@pololowww pololowww commented Aug 9, 2024

  • One-line PR description: Adding new KEP to fix inconsistent container start and ready state after kubelet restart

@k8s-ci-robot k8s-ci-robot requested a review from thockin August 9, 2024 09:14
@k8s-ci-robot
Copy link
Contributor

@pololowww: GitHub didn't allow me to request PR reviews from the following users: chenk008.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

  • One-line PR description: Adding new KEP to fix inconsistent container ready state after kubelet restart

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

linux-foundation-easycla bot commented Aug 9, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 9, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @pololowww!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 9, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @pololowww. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 9, 2024
@pacoxu
Copy link
Member

pacoxu commented Aug 9, 2024

/cc @bart0sh @SergeyKanzhelev
/assign @mrunalp

Copy link
Member

@pacoxu pacoxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @matthyx

@k8s-ci-robot k8s-ci-robot requested a review from matthyx August 9, 2024 10:23
@bart0sh
Copy link
Contributor

bart0sh commented Aug 9, 2024

Thank you for your PR. Please sign the CLA to proceed further, thanks.

```
##### Changes in probe

We will update the UpdatePodStatus function to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can also properly set started for a container (ref kubernetes/kubernetes#115553)

We will update the UpdatePodStatus function to:

1. Track containers with readiness probes.
2. Preserve the previous ready state for containers with readiness probes that haven't run yet after kubelet restart.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how you plan to persist ready states

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from API server - read the state of Pods as an initial state for probe managers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the ready state will be achieved from pod.Status.Conditions.

@pololowww
Copy link
Author

I have modified the details of my KEP to support inheritance of the Start and Ready states, and reconsidered the scenarios where the kubelet startup time exceeds the period. Updated in the new commit already.
Generally, when containerStartTime is before the Kubelet, we default to inheriting the previous container Start status, and the Ready status is achieved from API Server. If there are any unreasonable aspects to consider, please let me know :-)

@pololowww pololowww changed the title KEP-4781: Fix inconsistent container ready state after kubelet restart KEP-4781: Fix inconsistent container start and ready state after kubelet restart Aug 14, 2024
## Drawbacks

<!--
Why should this KEP _not_ be implemented?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothetically, if an edge case occurs where a pod that was ready transitions to an unready state when kubelet restarts, and we choose to treat it as still ready, the service's endpoints would still include this pod. This could lead to network traffic being directed to a pod that has encountered an issue.

This KEP is not without risks. We should articulate its drawbacks here and explain why, despite these drawbacks, we are still choosing to proceed with it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestions:-). Risks or drawbacks are already mentioned in my new commit. We also plan to trigger the probe immediately to reduce the risks.
BTW this is my first to write a KEP and i am not sure how detailed it should be.


##### Changes in kubelet_pods.go

We will be updating the `convertToAPIContainerStatuses` function, which is responsible for calculating the API-recognizable v1.ContainerStatus based on the old and current container states. Our modification will involve preserving the Started status from the oldStatus when the container's start time is earlier than the kubelet's start time. This will facilitate the subsequent `UpdatePodStatus` to inherit the Started state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we obtain the startup time of kubelet, could you briefly explain?

Copy link
Author

@pololowww pololowww Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the startup time of Kubelet probeManager is more accurate(worker.probeManager.start). Modified already in the new commit:)

extending the production code to implement this enhancement.
-->

- `<package>`: `<date>` - `<test coverage>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add your test plan. I assume you've already implemented a POC, so this should be easy for you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still applies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this feature will definitely need strong test coverage and may not be straightforward to test. we should start planning that early.

@ffromani
Copy link
Contributor

/cc

@k8s-ci-robot k8s-ci-robot requested a review from ffromani August 17, 2024 07:55
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pololowww
Once this PR has been reviewed and has the lgtm label, please ask for approval from mrunalp and additionally assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pololowww pololowww closed this Oct 4, 2024
@SergeyKanzhelev
Copy link
Member

@pololowww did you mean to close this PR?

@pololowww
Copy link
Author

@pololowww did you mean to close this PR?

@SergeyKanzhelev No, i just wanted to close a comment and maybe made some mistakes... Can we reopen it? I need to merge this PR actually.

@pololowww pololowww reopened this Oct 5, 2024
@HirazawaUi
Copy link
Contributor

@pololowww Could you please fix the failing test? You just need to run make update-toc locally and submit the updated content.

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PRR Shadow]

extending the production code to implement this enhancement.
-->

- `<package>`: `<date>` - `<test coverage>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this feature will definitely need strong test coverage and may not be straightforward to test. we should start planning that early.

- [ ] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: `PodUnreadyOnKubeletRestart`
- Components depending on the feature gate: `kubelet`
- [ ] Other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?


Due to the use of a feature gate, the feature can be disabled by setting the gate to false.

###### What happens if we reenable the feature if it was previously rolled back?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answer this one?

### Rollout, Upgrade and Rollback Planning

<!--
This section must be completed when targeting beta to a release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it must be completed only by Beta, I'd encourage thinking about beta/GA requirements sooner than later.

[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
-->

- [ ] Feature gate (also fill in values in `kep.yaml`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [ ] Feature gate (also fill in values in `kep.yaml`)
- [x] Feature gate (also fill in values in `kep.yaml`)

feature gate after having objects written with the new field) are also critical.
You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add something here. At the very minimum, for this feature, we should include tests that verify the behavior with the feature turned off.

@jpbetz
Copy link
Contributor

jpbetz commented Oct 10, 2024

Quick reminder that enhancement freeze is today and there is outstanding Production Readiness Review feedback that would need to be addressed for this to be approved.

@intUnderflow
Copy link
Member

Hey @pololowww, this is a really excellent first time KEP and a lot of folks are really excited about it. I saw you haven't been around for a few months, but we'd love to help you drive this through to merge. Is this still active on your side / would you like some support in closing this out?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 18, 2025
@intUnderflow
Copy link
Member

Planning to commandeer this into a new PR (with credit) once back from holiday unless anyone objects?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 17, 2025
@thockin
Copy link
Member

thockin commented May 18, 2025

@intUnderflow did you open a new KEP?

@thockin thockin removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 18, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2025
@HirazawaUi
Copy link
Contributor

@intUnderflow Are you still interested in this work? Or if you lack bandwidth, I'd be happy to continue this KEP in the v1.35 release cycle and implement the code for it.

@HirazawaUi
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 17, 2025
@HirazawaUi
Copy link
Contributor

A new PR #5493 has been opened to supersede this one.

/close

@k8s-ci-robot
Copy link
Contributor

@HirazawaUi: Closed this PR.

In response to this:

A new PR #5493 has been opened to supersede this one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.