Skip to content

Conversation

ArangoGutierrez
Copy link
Contributor

  • One-line PR description:

Update KEP 4680 README to include configurable HealthCheck into the DRA API

  • Issue link:
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 13, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 13, 2025
@ArangoGutierrez
Copy link
Contributor Author

cc @SergeyKanzhelev @Jpsassine

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 13, 2025
… message

- Added health_check_timeout_seconds field to DeviceHealth message
- Updated documentation to reflect that timeout is now configurable per device
- Changed Beta graduation criteria from 'implement' to 'verify' since feature is now included in initial design
- Addresses PR feedback about DRA API for timeout configuration

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
// Health check timeout duration in seconds for this device.
// If not specified or zero, Kubelet will use a default timeout.
// Optional.
int64 health_check_timeout_seconds = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the negative value or 0 mean? Let's specify in the field description.

// Health check timeout duration in seconds for this device.
// If not specified or zero, Kubelet will use a default timeout.
// Optional.
int64 health_check_timeout_seconds = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq for PR review I think - is Duration field OK to use in our APIs? Or this is not recommended?

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

we need to specify the behvior when negative value was set. Can do it during implementation

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 14, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ArangoGutierrez, SergeyKanzhelev
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ArangoGutierrez
Copy link
Contributor Author

/assign @mrunalp

ArangoGutierrez added a commit to ArangoGutierrez/kubernetes that referenced this pull request Aug 28, 2025
Implements device-specific health check timeouts in the DRA health monitoring
system as defined in KEP-4680. This allows DRA drivers to specify custom
timeout values for individual devices through the gRPC health API.

Changes:
- Add HealthCheckTimeout field to state.DeviceHealth struct to store
  device-specific timeout durations
- Add health_check_timeout_seconds field to DeviceHealth proto message
  in the DRA health gRPC API (v1alpha1)
- Update manager.go to extract timeout from gRPC responses and apply
  DefaultHealthTimeout (30s) when not specified
- Handle negative timeout values defensively by logging a warning and
  falling back to the default timeout
- Simplify healthinfo.go by removing redundant fallback logic since
  timeouts are now always set at creation time
- Update tests to include HealthCheckTimeout in test fixtures

The timeout behavior is:
- Positive values: Use the specified timeout in seconds
- Zero or unspecified: Use DefaultHealthTimeout (30 seconds)
- Negative values: Log warning and use DefaultHealthTimeout

This implementation provides flexibility for DRA drivers to define
appropriate health check intervals for different device types while
maintaining backward compatibility through sensible defaults.

Ref: KEP-4680 (Add Resource Health to Pod Status)
Ref: kubernetes/enhancements#5476

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ArangoGutierrez added a commit to ArangoGutierrez/kubernetes that referenced this pull request Aug 28, 2025
Implements device-specific health check timeouts in the DRA health monitoring
system as defined in KEP-4680. This allows DRA drivers to specify custom
timeout values for individual devices through the gRPC health API.

Changes:
- Add HealthCheckTimeout field to state.DeviceHealth struct to store
  device-specific timeout durations
- Add health_check_timeout_seconds field to DeviceHealth proto message
  in the DRA health gRPC API (v1alpha1)
- Update manager.go to extract timeout from gRPC responses and apply
  DefaultHealthTimeout (30s) when not specified
- Handle negative timeout values defensively by logging a warning and
  falling back to the default timeout
- Simplify healthinfo.go by removing redundant fallback logic since
  timeouts are now always set at creation time
- Update tests to include HealthCheckTimeout in test fixtures

The timeout behavior is:
- Positive values: Use the specified timeout in seconds
- Zero or unspecified: Use DefaultHealthTimeout (30 seconds)
- Negative values: Log warning and use DefaultHealthTimeout

This implementation provides flexibility for DRA drivers to define
appropriate health check intervals for different device types while
maintaining backward compatibility through sensible defaults.

Ref: KEP-4680 (Add Resource Health to Pod Status)
Ref: kubernetes/enhancements#5476

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ArangoGutierrez added a commit to ArangoGutierrez/kubernetes that referenced this pull request Aug 28, 2025
Implements device-specific health check timeouts in the DRA health monitoring
system as defined in KEP-4680. This allows DRA drivers to specify custom
timeout values for individual devices through the gRPC health API.

Changes:
- Add HealthCheckTimeout field to state.DeviceHealth struct to store
  device-specific timeout durations
- Add health_check_timeout_seconds field to DeviceHealth proto message
  in the DRA health gRPC API (v1alpha1)
- Update manager.go to extract timeout from gRPC responses and apply
  DefaultHealthTimeout (30s) when not specified
- Handle negative timeout values defensively by logging a warning and
  falling back to the default timeout
- Simplify healthinfo.go by removing redundant fallback logic since
  timeouts are now always set at creation time
- Update tests to include HealthCheckTimeout in test fixtures

The timeout behavior is:
- Positive values: Use the specified timeout in seconds
- Zero or unspecified: Use DefaultHealthTimeout (30 seconds)
- Negative values: Log warning and use DefaultHealthTimeout

This implementation provides flexibility for DRA drivers to define
appropriate health check intervals for different device types while
maintaining backward compatibility through sensible defaults.

Ref: KEP-4680 (Add Resource Health to Pod Status)
Ref: kubernetes/enhancements#5476

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ArangoGutierrez added a commit to ArangoGutierrez/kubernetes that referenced this pull request Aug 28, 2025
Implements device-specific health check timeouts in the DRA health monitoring
system as defined in KEP-4680. This allows DRA drivers to specify custom
timeout values for individual devices through the gRPC health API.

Changes:
- Add HealthCheckTimeout field to state.DeviceHealth struct to store
  device-specific timeout durations
- Add health_check_timeout_seconds field to DeviceHealth proto message
  in the DRA health gRPC API (v1alpha1)
- Update manager.go to extract timeout from gRPC responses and apply
  DefaultHealthTimeout (30s) when not specified
- Handle negative timeout values defensively by logging a warning and
  falling back to the default timeout
- Simplify healthinfo.go by removing redundant fallback logic since
  timeouts are now always set at creation time
- Update tests to include HealthCheckTimeout in test fixtures

The timeout behavior is:
- Positive values: Use the specified timeout in seconds
- Zero or unspecified: Use DefaultHealthTimeout (30 seconds)
- Negative values: Log warning and use DefaultHealthTimeout

This implementation provides flexibility for DRA drivers to define
appropriate health check intervals for different device types while
maintaining backward compatibility through sensible defaults.

Ref: KEP-4680 (Add Resource Health to Pod Status)
Ref: kubernetes/enhancements#5476

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants