Skip to content

Conversation

VietND96
Copy link
Member

@VietND96 VietND96 commented Aug 9, 2025

User description

🔗 Related Issues

Part of SeleniumHQ/docker-selenium#1849
Restructuring classes to get ready for the next step in the implementation of the external datastore support

💥 What does this PR do?

No new functions added or removed, just restructuring.

Separation of Complex Logic

  • Breaking down functions in a single LocalDistributor class
    • Delegating node management logic to LocalNodeRegistry
    • Session distribution logic remains under LocalDistributor
    • GridModel is updated to LocalGridModel

Clean Separation of Concerns
By restructuring these classes to bring all stateful functions under distributor/local, try to create a clear distinction between:

  • Interface definitions that define behavior (GridModel, NodeRegistry)
  • Local implementations that store state in-memory (LocalGridModel, LocalNodeRegistry, LocalDistributor).

Get ready for external data store support

  • Implementation of new packages (e.g., distributor/redis or distributor/jdbc) without modifying core components
  • Common interface definitions shared across all implementations
  • Plug-and-play storage backends through consistent interfaces

Get ready for high availability support
Once the external datastore can be implemented, it would enable multiple distributor replicas to run simultaneously:

  • Externalize all state to shared storage systems (Redis, JDBC, etc.)
  • Implement distributed locking for coordinated operations between replicas

Architectural Improvements

  • Dependency Inversion: Higher-level modules depend on abstractions, not concrete implementations
  • Extension Points: Clear extension points for new storage backends
    Testability: Easier mocking of components for testing
    Scalability: Path to horizontal scaling of distributor component

🔧 Implementation Notes

💡 Additional Considerations

🔄 Types of changes

  • Cleanup (formatting, renaming)
  • Bug fix (backwards compatible)
  • New feature (non-breaking change which adds functionality and tests!)
  • Breaking change (fix or feature that would cause existing functionality to change)

PR Type

Enhancement


Description

  • Restructure LocalDistributor to separate node management from session distribution

  • Extract node registry logic into LocalNodeRegistry interface

  • Convert GridModel to abstract class with LocalGridModel implementation

  • Prepare architecture for external datastore support


Diagram Walkthrough

flowchart LR
  A["LocalDistributor"] --> B["NodeRegistry Interface"]
  B --> C["LocalNodeRegistry"]
  C --> D["LocalGridModel"]
  A --> E["Session Distribution Logic"]
  F["GridModel Abstract"] --> D
Loading

File Walkthrough

Relevant files
Enhancement
GridModel.java
Convert GridModel to abstract interface                                   

java/src/org/openqa/selenium/grid/distributor/GridModel.java

  • Convert concrete class to abstract class
  • Define abstract methods for node management operations
  • Remove all implementation details and state management
  • Add comprehensive method documentation
+70/-489
NodeRegistry.java
Add NodeRegistry interface definition                                       

java/src/org/openqa/selenium/grid/distributor/NodeRegistry.java

  • Create new interface for node registry operations
  • Define methods for node lifecycle management
  • Include health checking and availability management
  • Add session and slot management methods
+148/-0 
LocalDistributor.java
Delegate node management to NodeRegistry                                 

java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java

  • Replace direct node management with NodeRegistry delegation
  • Remove GridModel instantiation and node storage
  • Simplify session creation and slot reservation logic
  • Clean up health check and node lifecycle code
+38/-293
LocalGridModel.java
Implement LocalGridModel with original logic                         

java/src/org/openqa/selenium/grid/distributor/local/LocalGridModel.java

  • Move GridModel implementation from abstract to concrete class
  • Implement all abstract methods with original logic
  • Maintain thread-safe node state management
  • Handle session and slot operations
+544/-0 
LocalNodeRegistry.java
Implement LocalNodeRegistry with node management                 

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java

  • Implement NodeRegistry interface with local storage
  • Manage node lifecycle and health checking
  • Handle event bus integration and scheduling
  • Provide thread-safe node operations
+536/-0 

@VietND96 VietND96 requested review from shs96c and diemol August 9, 2025 17:45
@selenium-ci selenium-ci added B-grid Everything grid and server related C-java Java Bindings labels Aug 9, 2025
Copy link
Contributor

qodo-merge-pro bot commented Aug 9, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ❌

1849 - Not compliant

Non-compliant requirements:

  • Return a clear, catchable error when proxy server refuses connections
  • Provide a programmatic detection mechanism for proxy refusal without DOM scraping
  • Python-specific handling or cross-language consistency
⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The method getNodeFromURI now relies on nodeRegistry.getAvailableNodes() which filters out DOWN/DRAINING nodes. If a session is created and the node transitions to a non-available state quickly, looking up by URI may fail to find the node for cleanup, leading to leaked sessions.

protected Node getNodeFromURI(URI uri) {
  Lock readLock = this.lock.readLock();
  readLock.lock();
  try {
    Set<NodeStatus> nodes = nodeRegistry.getAvailableNodes();
    Optional<NodeStatus> nodeStatus =
        nodes.stream().filter(node -> node.getExternalUri().equals(uri)).findFirst();
    return nodeStatus.map(status -> nodeRegistry.getNode(status.getNodeId())).orElse(null);
  } finally {
Concurrency

Health checks are scheduled both via a global fixed-rate task and additionally per-node scheduling stored in scheduledHealthChecks. Ensure no duplicate execution or race with runHealthChecks() that iterates allChecks concurrently; verify visibility and cancellation semantics to avoid double-running checks.

  nodes.put(node.getId(), node);
  model.add(initialNodeStatus);
  ScheduledFuture<?> future =
      nodeHealthCheckService.scheduleAtFixedRate(
          GuardedRunnable.guard(healthCheck),
          healthcheckInterval.toMillis(),
          healthcheckInterval.toMillis(),
          TimeUnit.MILLISECONDS);
  allChecks.put(node.getId(), healthCheck);
  scheduledHealthChecks.put(node.getId(), future);
} finally {
Behavior Change

isNotSupported no longer checks node availability explicitly. It depends on getAvailableNodes() to filter. Confirm that getAvailableNodes() semantics match previous UP-only behavior in all cases to avoid offering capacity from nodes that are not actually available.

private boolean isNotSupported(Capabilities caps) {
  return getAvailableNodes().stream().noneMatch(node -> node.hasCapability(caps, slotMatcher));
}

Copy link
Contributor

qodo-merge-pro bot commented Aug 9, 2025

PR Code Suggestions ✨

Latest suggestions up to f604877

CategorySuggestion                                                                                                                                    Impact
Incremental [*]
Process all health check batches
Suggestion Impact:The commit replaced the limited parallelStream().limit(maxConcurrentBatches) approach with a custom executor-based processing that submits all batches, ensuring every batch is processed while controlling concurrency. This addresses the issue of batches being dropped.

code diff:

@@ -402,6 +412,42 @@
     } catch (RuntimeException e) {
       return false;
     }
+  }
+
+  private void processBatchesInParallel(List<List<Runnable>> batches) {
+    if (batches.isEmpty()) {
+      return;
+    }
+
+    // Process all batches with controlled parallelism
+    batches.forEach(
+        batch ->
+            nodeHealthCheckExecutor.submit(
+                () ->
+                    batch.parallelStream()
+                        .forEach(
+                            r -> {
+                              try {
+                                r.run();
+                              } catch (Throwable t) {
+                                LOG.log(
+                                    getDebugLogLevel(),
+                                    "Health check execution failed in batch",
+                                    t);
+                              }
+                            })));
+  }
+
+  private static List<List<Runnable>> partition(List<Runnable> list, int size) {
+    List<List<Runnable>> batches = new ArrayList<>();
+    if (list.isEmpty() || size <= 0) {
+      return batches;
+    }
+    for (int i = 0; i < list.size(); i += size) {
+      int end = Math.min(i + size, list.size());
+      batches.add(new ArrayList<>(list.subList(i, end)));
+    }
+    return batches;
   }

Ensure all batches are processed by removing the limit(maxConcurrentBatches)
which drops excess batches, or implement a proper throttling mechanism that
still visits every batch

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [402-422]

 private void processBatchesInParallel(List<List<Runnable>> batches, int maxConcurrentBatches) {
   if (batches.isEmpty()) {
     return;
   }
 
-  // Process batches with controlled parallelism
   batches.parallelStream()
-      .limit(maxConcurrentBatches)
       .forEach(
           batch ->
               batch.parallelStream()
                   .forEach(
                       r -> {
                         try {
                           r.run();
                         } catch (Throwable t) {
-                          LOG.log(
-                              getDebugLogLevel(), "Health check execution failed in batch", t);
+                          LOG.log(getDebugLogLevel(), "Health check execution failed in batch", t);
                         }
                       }));
 }

[Suggestion processed]

Suggestion importance[1-10]: 9

__

Why: This suggestion identifies a critical bug where limit(maxConcurrentBatches) would cause health checks for some nodes to be skipped entirely, and the proposed fix correctly ensures all batches are processed.

High
Learned
best practice
Simplify heartbeat timeout math
Suggestion Impact:The commit implemented the inverse change of the suggestion: it replaced the simplified single multiplication by half the multiplier with the chained multipliedBy(...).dividedBy(2) form. This shows the code area and logic were directly affected, albeit in the opposite direction of the suggestion.

code diff:

         Instant lostTime =
-            lastTouched.plus(node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER / 2));
+            lastTouched.plus(
+                node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER).dividedBy(2));

Use consistent and clearer time calculations; avoid chained multiply/divide on
Duration which can be error-prone. Replace with a single multiplication by half
the multiplier to match intent and readability.

java/src/org/openqa/selenium/grid/distributor/local/LocalGridModel.java [245-247]

 Instant lostTime =
-    lastTouched.plus(
-        node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER).dividedBy(2));
+    lastTouched.plus(node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER / 2));

[Suggestion processed]

Suggestion importance[1-10]: 6

__

Why:
Relevant best practice - Fix syntax errors, typos, and naming inconsistencies to maintain code quality.

Low
Guard updates for unknown nodes

Validate that the node exists before updating its availability to avoid
operating on unknown nodes. Add a guard that checks nodes.containsKey(id) and
log/return early if not found.

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [274-289]

 public void updateNodeAvailability(URI nodeUri, NodeId id, Availability availability) {
   Require.nonNull("Node URI", nodeUri);
   Require.nonNull("Node ID", id);
   Require.nonNull("Availability", availability);
   Lock writeLock = lock.writeLock();
   writeLock.lock();
   try {
+    if (!nodes.containsKey(id)) {
+      LOG.log(Level.FINE, String.format("Ignoring availability update for unknown node %s", id));
+      return;
+    }
     LOG.log(
         getDebugLogLevel(),
         String.format("Health check result for %s was %s", nodeUri, availability));
     model.setAvailability(id, availability);
     model.updateHealthCheckCount(id, availability);
   } finally {
     writeLock.unlock();
   }
 }
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why:
Relevant best practice - Add null checks and validate parameters before use to prevent NullPointerException and similar runtime errors.

Low
  • Update

Previous suggestions

✅ Suggestions up to commit 7ef687a
CategorySuggestion                                                                                                                                    Impact
Incremental [*]
Cancel tasks and free resources
Suggestion Impact:The commit modified close() to acquire the write lock, clear maps, and close RemoteNodes under the lock. While it did not cancel ScheduledFutures (it removed the scheduling and the scheduledHealthChecks map entirely), it addressed the resource/leak concern by clearing allChecks and nodes and properly closing RemoteNodes during shutdown.

code diff:

@@ -535,5 +571,25 @@
   @Override
   public void close() {
     LOG.info("Shutting down LocalNodeRegistry");
+    Lock writeLock = lock.writeLock();
+    writeLock.lock();
+    try {
+      allChecks.clear();
+      nodes
+          .values()
+          .forEach(
+              n -> {
+                if (n instanceof RemoteNode) {
+                  try {
+                    ((RemoteNode) n).close();
+                  } catch (Exception e) {
+                    LOG.log(Level.WARNING, "Unable to close node properly: " + e.getMessage());
+                  }
+                }
+              });
+      nodes.clear();
+    } finally {
+      writeLock.unlock();
+    }
   }

Ensure scheduled tasks are cancelled and resources are released on close; cancel
per-node futures and clear maps to prevent thread and memory leaks

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [535-539]

 @Override
 public void close() {
   LOG.info("Shutting down LocalNodeRegistry");
+  Lock writeLock = lock.writeLock();
+  writeLock.lock();
+  try {
+    scheduledHealthChecks.values().forEach(f -> {
+      if (f != null) {
+        f.cancel(false);
+      }
+    });
+    scheduledHealthChecks.clear();
+    allChecks.clear();
+    nodes.values().forEach(n -> {
+      if (n instanceof RemoteNode) {
+        try {
+          ((RemoteNode) n).close();
+        } catch (Exception e) {
+          LOG.log(Level.WARNING, "Unable to close node properly: " + e.getMessage());
+        }
+      }
+    });
+    nodes.clear();
+  } finally {
+    writeLock.unlock();
+  }
 }

[Suggestion processed]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly points out a critical resource leak where scheduled health checks are not cancelled on close, which would prevent proper shutdown and cause memory leaks.

High
Eliminate duplicate healthcheck scheduling

Prevent running both global periodic health checks and per-node scheduled checks
simultaneously; choose one mechanism to avoid double healthcheck execution
causing noisy state flaps

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [132-136]

-this.nodeHealthCheckService.scheduleAtFixedRate(
-    GuardedRunnable.guard(this::runHealthChecks),
-    healthcheckInterval.toMillis(),
-    healthcheckInterval.toMillis(),
-    TimeUnit.MILLISECONDS);
+// Use only per-node scheduled health checks; remove global scheduler
+// this.nodeHealthCheckService.scheduleAtFixedRate(
+//     GuardedRunnable.guard(this::runHealthChecks),
+//     healthcheckInterval.toMillis(),
+//     healthcheckInterval.toMillis(),
+//     TimeUnit.MILLISECONDS);
Suggestion importance[1-10]: 8

__

Why: This suggestion correctly identifies that health checks are scheduled twice, which could lead to race conditions and performance issues, and proposes a valid fix.

Medium
Learned
best practice
Add parameter null check

Validate that the provided id is not null before accessing the map to avoid
potential NullPointerException. Add a guard clause using Require.nonNull or an
explicit null check.

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [367-369]

 public Node getNode(NodeId id) {
+  Require.nonNull("Node ID", id);
   return nodes.get(id);
 }
Suggestion importance[1-10]: 6

__

Why:
Relevant best practice - Add null checks and validate parameters before use to prevent NullPointerException.

Low
General
Clarify purge timeout calculations

Avoid the chained arithmetic for half the timeout; it is easy to misread and
risks integer rounding surprises if the multiplier changes. Compute the
half-timeout explicitly to preserve intent and clarity. This also prevents
accidental overflow if the multiplier is adjusted.

java/src/org/openqa/selenium/grid/distributor/local/LocalGridModel.java [245-249]

-Instant lostTime =
-    lastTouched.plus(
-        node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER).dividedBy(2));
-Instant deadTime =
-    lastTouched.plus(node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER));
+Duration heartbeat = node.getHeartbeatPeriod();
+Duration halfTimeout = heartbeat.multipliedBy(PURGE_TIMEOUT_MULTIPLIER).dividedBy(2);
+Duration fullTimeout = heartbeat.multipliedBy(PURGE_TIMEOUT_MULTIPLIER);
+Instant lostTime = lastTouched.plus(halfTimeout);
+Instant deadTime = lastTouched.plus(fullTimeout);
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly proposes extracting complex Duration calculations into named variables, which improves code readability and maintainability.

Low
✅ Suggestions up to commit c7aeee4
CategorySuggestion                                                                                                                                    Impact
High-level
Duplicated health-check scheduling

Node health checks are scheduled both inside LocalNodeRegistry (per-node
ScheduledFuture) and also via a fixed-rate executor created in LocalDistributor
previously; this restructuring removed the latter but retained periodic
scheduling plus a global runHealthChecks, leading to potential double execution
if extended later or misused. Consolidate to a single mechanism (either per-node
ScheduledFuture or a single periodic sweep) to avoid duplicate checks, race
conditions, and resource leaks across different registry implementations.

Examples:

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [132-136]
this.nodeHealthCheckService.scheduleAtFixedRate(
    GuardedRunnable.guard(this::runHealthChecks),
    healthcheckInterval.toMillis(),
    healthcheckInterval.toMillis(),
    TimeUnit.MILLISECONDS);
java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [208-215]
ScheduledFuture<?> future =
    nodeHealthCheckService.scheduleAtFixedRate(
        GuardedRunnable.guard(healthCheck),
        healthcheckInterval.toMillis(),
        healthcheckInterval.toMillis(),
        TimeUnit.MILLISECONDS);
allChecks.put(node.getId(), healthCheck);
scheduledHealthChecks.put(node.getId(), future);

Solution Walkthrough:

Before:

class LocalNodeRegistry {
  LocalNodeRegistry(...) {
    // A global sweep of all health checks is scheduled
    this.nodeHealthCheckService.scheduleAtFixedRate(
        GuardedRunnable.guard(this::runHealthChecks), ...);
  }

  public void add(Node node) {
    // ...
    // An individual health check is also scheduled for each new node
    ScheduledFuture<?> future = nodeHealthCheckService.scheduleAtFixedRate(
        GuardedRunnable.guard(healthCheck), ...);
    scheduledHealthChecks.put(node.getId(), future);
  }
}

After:

class LocalNodeRegistry {
  LocalNodeRegistry(...) {
    // Keep the global sweep of all health checks
    this.nodeHealthCheckService.scheduleAtFixedRate(
        GuardedRunnable.guard(this::runHealthChecks), ...);
  }

  public void add(Node node) {
    // ...
    // Remove the duplicated, per-node scheduling
    allChecks.put(node.getId(), healthCheck);
    // The global sweep will pick up the new check from `allChecks`
  }
}
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a significant design flaw in LocalNodeRegistry where health checks are scheduled twice, leading to redundant execution, increased load, and potential race conditions.

High
Possible issue
Use concurrent maps for health checks
Suggestion Impact:The commit uses ConcurrentHashMap for allChecks (line 15 shows it defined as ConcurrentHashMap). Additionally, it removes the scheduled health checks map entirely and related scheduling/cancellation logic, which eliminates the thread-safety concern for that map. Thus, the suggestion to use concurrent maps impacted the commit: one map was made concurrent and the other was removed.

code diff:

@@ -83,7 +83,6 @@
   private final GridModel model;
   private final Map<NodeId, Node> nodes;
   private final Map<NodeId, Runnable> allChecks = new ConcurrentHashMap<>();
-  private final Map<NodeId, ScheduledFuture<?>> scheduledHealthChecks = new ConcurrentHashMap<>();
   private final ReadWriteLock lock = new ReentrantReadWriteLock(/* fair */ true);
   private final ScheduledExecutorService nodeHealthCheckService;
   private final Duration purgeNodesInterval;
@@ -205,14 +204,7 @@
       try {
         nodes.put(node.getId(), node);
         model.add(initialNodeStatus);
-        ScheduledFuture<?> future =
-            nodeHealthCheckService.scheduleAtFixedRate(
-                GuardedRunnable.guard(healthCheck),
-                healthcheckInterval.toMillis(),
-                healthcheckInterval.toMillis(),
-                TimeUnit.MILLISECONDS);
         allChecks.put(node.getId(), healthCheck);
-        scheduledHealthChecks.put(node.getId(), future);

Access to allChecks and scheduledHealthChecks occurs under locks but the maps
themselves are non-concurrent. Replace them with concurrent maps to avoid
accidental unsynchronized access paths and reduce risk of race conditions during
listener callbacks. This is critical since listeners and scheduled tasks can run
concurrently.

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [85-86]

-private final Map<NodeId, Runnable> allChecks = new HashMap<>();
-private final Map<NodeId, ScheduledFuture<?>> scheduledHealthChecks = new HashMap<>();
+private final Map<NodeId, Runnable> allChecks = new ConcurrentHashMap<>();
+private final Map<NodeId, ScheduledFuture<?>> scheduledHealthChecks = new ConcurrentHashMap<>();
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that using ConcurrentHashMap for allChecks and scheduledHealthChecks improves robustness and maintainability in a highly concurrent class, reducing the risk of future errors.

Low
Fix integer division in timing
Suggestion Impact:The commit updated the lostTime calculation to multiply by PURGE_TIMEOUT_MULTIPLIER before dividing by 2, preventing integer truncation. The deadTime line remained unchanged in the patch hunk, but the key issue (lostTime integer division) was addressed.

code diff:

         Instant lostTime =
-            lastTouched.plus(node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER / 2));
+            lastTouched.plus(
+                node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER).dividedBy(2));

The division PURGE_TIMEOUT_MULTIPLIER / 2 performs integer division; for odd
values it truncates and may degrade timing logic. Compute using long arithmetic
to preserve intended scaling, e.g., multiply first or use exact constants.

java/src/org/openqa/selenium/grid/distributor/local/LocalGridModel.java [245-248]

 Instant lostTime =
-    lastTouched.plus(node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER / 2));
+    lastTouched.plus(node.getHeartbeatPeriod().multipliedBy((long) PURGE_TIMEOUT_MULTIPLIER).dividedBy(2));
 Instant deadTime =
-    lastTouched.plus(node.getHeartbeatPeriod().multipliedBy(PURGE_TIMEOUT_MULTIPLIER));
+    lastTouched.plus(node.getHeartbeatPeriod().multipliedBy((long) PURGE_TIMEOUT_MULTIPLIER));

[Suggestion processed]

Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out that integer division on PURGE_TIMEOUT_MULTIPLIER could lead to precision loss, and using long arithmetic or floating-point division would make the timing calculation more robust.

Low
Learned
best practice
Cancel tasks and close nodes
Suggestion Impact:The close() method was updated to acquire the write lock, clear checks, and close RemoteNode instances. However, cancellation of ScheduledFutures was not implemented; in fact, the code removed tracking of ScheduledFutures entirely. So the commit partially implemented the shutdown/cleanup aspect (closing nodes and clearing checks) but did not cancel scheduled tasks as suggested.

code diff:

@@ -532,5 +571,25 @@
   @Override
   public void close() {
     LOG.info("Shutting down LocalNodeRegistry");
+    Lock writeLock = lock.writeLock();
+    writeLock.lock();
+    try {
+      allChecks.clear();
+      nodes
+          .values()
+          .forEach(
+              n -> {
+                if (n instanceof RemoteNode) {
+                  try {
+                    ((RemoteNode) n).close();
+                  } catch (Exception e) {
+                    LOG.log(Level.WARNING, "Unable to close node properly: " + e.getMessage());
+                  }
+                }
+              });
+      nodes.clear();
+    } finally {
+      writeLock.unlock();
+    }
   }

Ensure scheduled tasks are cancelled and nodes are closed on shutdown. Cancel
futures in scheduledHealthChecks, clear listeners if needed, and close any
RemoteNode instances to prevent resource leaks.

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [532-535]

 @Override
 public void close() {
   LOG.info("Shutting down LocalNodeRegistry");
+  Lock writeLock = lock.writeLock();
+  writeLock.lock();
+  try {
+    // Cancel scheduled health checks
+    for (ScheduledFuture<?> future : scheduledHealthChecks.values()) {
+      if (future != null) {
+        future.cancel(false);
+      }
+    }
+    scheduledHealthChecks.clear();
+    allChecks.clear();
+    // Close remote nodes
+    for (Node node : nodes.values()) {
+      if (node instanceof RemoteNode) {
+        try {
+          ((RemoteNode) node).close();
+        } catch (Exception e) {
+          LOG.log(Level.WARNING, "Unable to close node properly: " + e.getMessage());
+        }
+      }
+    }
+    nodes.clear();
+  } finally {
+    writeLock.unlock();
+  }
 }
Suggestion importance[1-10]: 6

__

Why:
Relevant best practice - Properly dispose and clean up scheduled tasks and resources to prevent leaks.

Low
Add parameter null validation
Suggestion Impact:The commit added Require.nonNull checks for Node URI, Node ID, and Availability at the start of updateNodeAvailability, matching the suggestion.

code diff:

+    Require.nonNull("Node URI", nodeUri);
+    Require.nonNull("Node ID", id);
+    Require.nonNull("Availability", availability);

Validate inputs before using them. Add null checks for nodeUri, id, and
availability to avoid potential NullPointerExceptions during logging and model
updates.

java/src/org/openqa/selenium/grid/distributor/local/LocalNodeRegistry.java [292-304]

 public void updateNodeAvailability(URI nodeUri, NodeId id, Availability availability) {
+  Require.nonNull("Node URI", nodeUri);
+  Require.nonNull("Node ID", id);
+  Require.nonNull("Availability", availability);
   Lock writeLock = lock.writeLock();
   writeLock.lock();
   try {
     LOG.log(
         getDebugLogLevel(),
         String.format("Health check result for %s was %s", nodeUri, availability));
     model.setAvailability(id, availability);
     model.updateHealthCheckCount(id, availability);
   } finally {
     writeLock.unlock();
   }
 }

[Suggestion processed]

Suggestion importance[1-10]: 5

__

Why:
Relevant best practice - Add null checks and validation for parameters before use to prevent NullPointerException.

Low

@VietND96 VietND96 force-pushed the local-distributor branch from c7aeee4 to c7e2984 Compare August 9, 2025 19:11
Copy link
Member

@diemol diemol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it makes sense to me. Thank you for this.

Please have a look at the AI reviewer suggestions and see if they make sense or not.

Signed-off-by: Viet Nguyen Duc <[email protected]>
@VietND96
Copy link
Member Author

@diemol on AI review points that I fixed

  • Use concurrent maps for health checks
  • Add parameter null validation
  • Fix integer division in timing

On point "Duplicated health-check scheduling" - I am thinking another improvement on other PR.
Assume the Distributor manages 1000 Nodes, health checks performed on those Nodes would consume high resources. I intend to improve health checks as below strategy

  • 1st sort out and priority list of Nodes which have at least 1 available slot (catch up latest health status of those would help process to distribute new request to available Nodes ASAP and reliably). The rest of the Nodes (most are full slots, or DOWN) will check the following.
  • Apply Atomic Batch Processing for efficient resource allocation and faster.

@VietND96 VietND96 changed the title [grid] Restructuring classes have stateful data in LocalDistributor [grid] Restructuring classes have stateful data and improve Node health checks in LocalDistributor Aug 11, 2025
@VietND96 VietND96 merged commit 2dad5fa into trunk Aug 11, 2025
31 checks passed
@VietND96 VietND96 deleted the local-distributor branch August 11, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-grid Everything grid and server related C-java Java Bindings Review effort 4/5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants