You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+15-6Lines changed: 15 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -259,12 +259,13 @@ We may consider this as a future improvement.
259
259
260
260
### Notes/Constraints/Caveats (Optional)
261
261
262
-
<!--
263
-
What are the caveats to the proposal?
264
-
What are some important details that didn't come across above?
265
-
Go in to as much detail as necessary here.
266
-
This might be a good place to talk about core concepts and how they relate.
267
-
-->
262
+
-**DRA Device Health Timeout Configuration:** Currently, the timeout for marking a DRA device's health as "Unknown"
263
+
when no updates are received is hardcoded to 30 seconds. This is not ideal as different hardware types
264
+
(e.g., GPUs, FPGAs, TPUs, storage devices) may have significantly different health-reporting characteristics
265
+
and require different timeout values. A per-plugin configurable timeout setting will be implemented before
266
+
Beta graduation to allow each vendor to specify an appropriate timeout for their hardware. See
267
+
[Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118) and the discussion in
268
+
[PR #130606](https://github.com/kubernetes/kubernetes/pull/130606/files#r2221829511) for more details.
268
269
269
270
### Risks and Mitigations
270
271
@@ -310,6 +311,13 @@ optional, proactive health reporting mechanism from DRA plugins.
310
311
will be responsible for reconciling the state reported by the plugin, handling
311
312
timeouts for stale data (marking devices as "Unknown" if not updated
312
313
within a certain period), and persisting this information across Kubelet restarts.
314
+
315
+
**Note:** The timeout for marking a device's health as "Unknown" is currently
316
+
hardcoded to 30 seconds. As tracked in [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118),
317
+
this timeout should be made configurable per plugin, as different hardware types
318
+
(e.g., GPUs, FPGAs, TPUs, storage) may have very different health-reporting
319
+
characteristics and require different timeout values. Making this configurable
320
+
is a prerequisite for graduating this feature to Beta.
313
321
314
322
3.**Kubelet Integration:** The DRA Manager in Kubelet will act as the gRPC client.
315
323
Upon plugin registration, it will attempt to initiate the health monitoring
@@ -448,6 +456,7 @@ Planned tests will cover the user-visible behavior of the feature:
448
456
#### Beta
449
457
450
458
- Complete e2e tests coverage
459
+
- Make device health check timeout configurable per plugin (see [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118))
0 commit comments