Skip to content

Commit a751de7

Browse files
[KEP-4680] Update README to include configurable HealhCheckTimeout
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
1 parent 9177d4a commit a751de7

File tree

1 file changed

+15
-6
lines changed
  • keps/sig-node/4680-add-resource-health-to-pod-status

1 file changed

+15
-6
lines changed

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -259,12 +259,13 @@ We may consider this as a future improvement.
259259

260260
### Notes/Constraints/Caveats (Optional)
261261

262-
<!--
263-
What are the caveats to the proposal?
264-
What are some important details that didn't come across above?
265-
Go in to as much detail as necessary here.
266-
This might be a good place to talk about core concepts and how they relate.
267-
-->
262+
- **DRA Device Health Timeout Configuration:** Currently, the timeout for marking a DRA device's health as "Unknown"
263+
when no updates are received is hardcoded to 30 seconds. This is not ideal as different hardware types
264+
(e.g., GPUs, FPGAs, TPUs, storage devices) may have significantly different health-reporting characteristics
265+
and require different timeout values. A per-plugin configurable timeout setting will be implemented before
266+
Beta graduation to allow each vendor to specify an appropriate timeout for their hardware. See
267+
[Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118) and the discussion in
268+
[PR #130606](https://github.com/kubernetes/kubernetes/pull/130606/files#r2221829511) for more details.
268269

269270
### Risks and Mitigations
270271

@@ -310,6 +311,13 @@ optional, proactive health reporting mechanism from DRA plugins.
310311
will be responsible for reconciling the state reported by the plugin, handling
311312
timeouts for stale data (marking devices as "Unknown" if not updated
312313
within a certain period), and persisting this information across Kubelet restarts.
314+
315+
**Note:** The timeout for marking a device's health as "Unknown" is currently
316+
hardcoded to 30 seconds. As tracked in [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118),
317+
this timeout should be made configurable per plugin, as different hardware types
318+
(e.g., GPUs, FPGAs, TPUs, storage) may have very different health-reporting
319+
characteristics and require different timeout values. Making this configurable
320+
is a prerequisite for graduating this feature to Beta.
313321

314322
3. **Kubelet Integration:** The DRA Manager in Kubelet will act as the gRPC client.
315323
Upon plugin registration, it will attempt to initiate the health monitoring
@@ -448,6 +456,7 @@ Planned tests will cover the user-visible behavior of the feature:
448456
#### Beta
449457

450458
- Complete e2e tests coverage
459+
- Make device health check timeout configurable per plugin (see [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118))
451460

452461
#### GA
453462

0 commit comments

Comments
 (0)