[KEP-4680] Update README to include configurable HealhCheckTimeout

ArangoGutierrez · ArangoGutierrez · commit a751de7b2c21 · 2025-08-13T13:19:27.000+02:00
Signed-off-by: Carlos Eduardo Arango Gutierrez &lt;eduardoa@nvidia.com&gt;
diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -259,12 +259,13 @@ We may consider this as a future improvement.
 
 ### Notes/Constraints/Caveats (Optional)
 
-<!--
-What are the caveats to the proposal?
-What are some important details that didn't come across above?
-Go in to as much detail as necessary here.
-This might be a good place to talk about core concepts and how they relate.
--->
+- **DRA Device Health Timeout Configuration:** Currently, the timeout for marking a DRA device's health as "Unknown" 
+  when no updates are received is hardcoded to 30 seconds. This is not ideal as different hardware types 
+  (e.g., GPUs, FPGAs, TPUs, storage devices) may have significantly different health-reporting characteristics 
+  and require different timeout values. A per-plugin configurable timeout setting will be implemented before 
+  Beta graduation to allow each vendor to specify an appropriate timeout for their hardware. See 
+  [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118) and the discussion in 
+  [PR #130606](https://github.com/kubernetes/kubernetes/pull/130606/files#r2221829511) for more details.
 
 ### Risks and Mitigations
 
@@ -310,6 +311,13 @@ optional, proactive health reporting mechanism from DRA plugins.
     will be responsible for reconciling the state reported by the plugin, handling
     timeouts for stale data (marking devices as "Unknown" if not updated
     within a certain period), and persisting this information across Kubelet restarts.
+    
+    **Note:** The timeout for marking a device's health as "Unknown" is currently
+    hardcoded to 30 seconds. As tracked in [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118),
+    this timeout should be made configurable per plugin, as different hardware types
+    (e.g., GPUs, FPGAs, TPUs, storage) may have very different health-reporting
+    characteristics and require different timeout values. Making this configurable
+    is a prerequisite for graduating this feature to Beta.
 
 3.  **Kubelet Integration:** The DRA Manager in Kubelet will act as the gRPC client.
     Upon plugin registration, it will attempt to initiate the health monitoring
@@ -448,6 +456,7 @@ Planned tests will cover the user-visible behavior of the feature:
 #### Beta
 
 - Complete e2e tests coverage
+- Make device health check timeout configurable per plugin (see [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118))
 
 #### GA