[KEP-4680] Update README to include configurable HealhCheckTimeout #5476

ArangoGutierrez · 2025-08-13T11:20:51Z

One-line PR description:

Update KEP 4680 README to include configurable HealthCheck into the DRA API

Issue link:

Other comments:

ArangoGutierrez · 2025-08-13T11:21:06Z

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

… message - Added health_check_timeout_seconds field to DeviceHealth message - Updated documentation to reflect that timeout is now configurable per device - Changed Beta graduation criteria from 'implement' to 'verify' since feature is now included in initial design - Addresses PR feedback about DRA API for timeout configuration Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

SergeyKanzhelev · 2025-08-14T21:53:31Z

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

+  // Health check timeout duration in seconds for this device.
+  // If not specified or zero, Kubelet will use a default timeout.
+  // Optional.
+  int64 health_check_timeout_seconds = 5;


What would the negative value or 0 mean? Let's specify in the field description.

SergeyKanzhelev · 2025-08-14T21:53:59Z

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

+  // Health check timeout duration in seconds for this device.
+  // If not specified or zero, Kubelet will use a default timeout.
+  // Optional.
+  int64 health_check_timeout_seconds = 5;


qq for PR review I think - is Duration field OK to use in our APIs? Or this is not recommended?

SergeyKanzhelev

/lgtm

we need to specify the behvior when negative value was set. Can do it during implementation

k8s-ci-robot · 2025-08-14T21:55:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ArangoGutierrez, SergeyKanzhelev
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ArangoGutierrez · 2025-08-15T06:12:31Z

/assign @mrunalp

Implements device-specific health check timeouts in the DRA health monitoring system as defined in KEP-4680. This allows DRA drivers to specify custom timeout values for individual devices through the gRPC health API. Changes: - Add HealthCheckTimeout field to state.DeviceHealth struct to store device-specific timeout durations - Add health_check_timeout_seconds field to DeviceHealth proto message in the DRA health gRPC API (v1alpha1) - Update manager.go to extract timeout from gRPC responses and apply DefaultHealthTimeout (30s) when not specified - Handle negative timeout values defensively by logging a warning and falling back to the default timeout - Simplify healthinfo.go by removing redundant fallback logic since timeouts are now always set at creation time - Update tests to include HealthCheckTimeout in test fixtures The timeout behavior is: - Positive values: Use the specified timeout in seconds - Zero or unspecified: Use DefaultHealthTimeout (30 seconds) - Negative values: Log warning and use DefaultHealthTimeout This implementation provides flexibility for DRA drivers to define appropriate health check intervals for different device types while maintaining backward compatibility through sensible defaults. Ref: KEP-4680 (Add Resource Health to Pod Status) Ref: kubernetes/enhancements#5476 Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 13, 2025

k8s-ci-robot requested review from dchen1107 and mrunalp August 13, 2025 11:20

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 13, 2025

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 13, 2025

SergeyKanzhelev reviewed Aug 13, 2025

View reviewed changes

keps/sig-node/4680-add-resource-health-to-pod-status/README.md Outdated Show resolved Hide resolved

ArangoGutierrez force-pushed the i133118 branch from a751de7 to 8279c43 Compare August 14, 2025 08:36

ArangoGutierrez requested a review from SergeyKanzhelev August 14, 2025 08:37

ArangoGutierrez mentioned this pull request Aug 14, 2025

[KEP-4680] DRA: Make device health check timeout configurable kubernetes/kubernetes#133118

Open

SergeyKanzhelev reviewed Aug 14, 2025

View reviewed changes

SergeyKanzhelev approved these changes Aug 14, 2025

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Aug 14, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 14, 2025

k8s-ci-robot assigned mrunalp Aug 15, 2025

ArangoGutierrez mentioned this pull request Aug 28, 2025

DRA: Add configurable health check timeout per device kubernetes/kubernetes#133752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KEP-4680] Update README to include configurable HealhCheckTimeout #5476

[KEP-4680] Update README to include configurable HealhCheckTimeout #5476

ArangoGutierrez commented Aug 13, 2025

Uh oh!

ArangoGutierrez commented Aug 13, 2025

Uh oh!

Uh oh!

SergeyKanzhelev Aug 14, 2025

Uh oh!

SergeyKanzhelev Aug 14, 2025

Uh oh!

SergeyKanzhelev left a comment

Uh oh!

k8s-ci-robot commented Aug 14, 2025

Uh oh!

ArangoGutierrez commented Aug 15, 2025

Uh oh!

Uh oh!

[KEP-4680] Update README to include configurable HealhCheckTimeout #5476

Are you sure you want to change the base?

[KEP-4680] Update README to include configurable HealhCheckTimeout #5476

Conversation

ArangoGutierrez commented Aug 13, 2025

Uh oh!

ArangoGutierrez commented Aug 13, 2025

Uh oh!

Uh oh!

SergeyKanzhelev Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Aug 14, 2025

Uh oh!

ArangoGutierrez commented Aug 15, 2025

Uh oh!

Uh oh!