Skip to content

Conversation

VietND96
Copy link
Member

@VietND96 VietND96 commented Mar 4, 2025

User description

Thanks for contributing to Selenium!
A PR well described will help maintainers to quickly review and merge it

Before submitting your PR, please check our contributing guidelines.
Avoid large PRs, help reviewers by making them as simple and short as possible.

Motivation and Context

Reproduce environment:

  • Grid runs in containers, where multiple Nodes continuously attach/detach from the Hub. With a combination config of SE_NODE_MAX_SESSIONS > 1 and SE_DRAIN_AFTER_SESSION_COUNT > 1
  • When SE_DRAIN_AFTER_SESSION_COUNT, the same container spawns up again with the same IP assigned, which results in a situation of NodeRestartedEvent
  • Exception trace as below

tests-1 | Build info: version: '4.35.0-SNAPSHOT', revision: 'Unknown'
tests-1 | System info: os.name: 'Linux', os.arch: 'aarch64', os.version: '6.10.14-linuxkit', java.version: '21.0.8'
tests-1 | Driver info: driver.version: unknown
tests-1 | at org.openqa.selenium.grid.sessionmap.local.LocalSessionMap.get(LocalSessionMap.java:122)
tests-1 | at org.openqa.selenium.grid.sessionmap.SessionMap.getUri(SessionMap.java:84)
tests-1 | at org.openqa.selenium.grid.router.HandleSession.lambda$loadSessionId$4(HandleSession.java:225)
tests-1 | at io.opentelemetry.context.Context.lambda$wrap$2(Context.java:253)

Fixed race conditions in LocalSessionMap during concurrent node events by implementing thread-safe session management with read-write locks and defensive event handling.

Race Condition Fix

  • Problem: Concurrent NodeRemovedEvent and NodeRestartedEvent during node shutdowns could cause thread safety issues when multiple events tried to remove the same sessions simultaneously.

  • Solution:

    • Added ReentrantReadWriteLock to coordinate concurrent access to session storage
    • Read locks for session lookups (allowing concurrent reads)
    • Write locks for session modifications (exclusive access)
    • Batch removal with defensive checks prevents errors when sessions are already removed by concurrent events

Improve logging of Local Session Map

  • Remove a single session message with the attached sessionId and nodeUri
  • Remove all sessions by NodeRemovedEvent or NodeRestartedEvent message with the attached event class name triggered and nodeUri

06:16:10.933 INFO [GridModel.release] - Releasing slot for session id 32bc8a19-be78-4f11-8d80-39022e723160
06:16:10.934 INFO [LocalSessionMap.remove] - Deleted session from local Session Map, Id: 32bc8a19-be78-4f11-8d80-39022e723160, Node: http://172.18.0.10:5555
06:16:11.943 INFO [LocalSessionMap.batchRemoveByUri] - Event org.openqa.selenium.grid.data.NodeRemovedEvent triggered batch remove from local Session Map for Node http://172.18.0.10:5555
06:16:11.944 INFO [LocalSessionMap.batchRemoveByUri] - No sessions found to remove from local Session Map for Node http://172.18.0.10:5555
06:16:21.169 INFO [LocalSessionMap.remove] - Deleted session from local Session Map, Id: 96619a55bef67d0089d3184187479071, Node: http://172.18.0.5:5555

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

PR Type

Bug fix


Description

  • Replaced NodeRemovedEvent with node.drain() in NodeServer shutdown hook.

  • Ensures proper session handling during node shutdown.

  • Prevents unexpected session removal during server glitches or restarts.


Changes walkthrough 📝

Relevant files
Bug fix
NodeServer.java
Updated NodeServer shutdown hook for graceful shutdown     

java/src/org/openqa/selenium/grid/node/httpd/NodeServer.java

  • Removed NodeRemovedEvent firing in the shutdown hook.
  • Added node.drain() call to handle node draining during shutdown.
  • Improved session management and shutdown behavior.
  • +1/-3     

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • Copy link
    Contributor

    qodo-merge-pro bot commented Mar 4, 2025

    PR Reviewer Guide 🔍

    (Review updated until commit 6f98efc)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Resource Leak

    The maintenance executor is created but may not be properly closed in all scenarios. While the close() method exists, there's no guarantee it will be called, potentially leading to thread leaks. Consider implementing proper lifecycle management or using try-with-resources pattern.

    this.maintenanceExecutor =
        Executors.newSingleThreadScheduledExecutor(
            r -> {
              Thread t = new Thread(r, "LocalSessionMap-Maintenance");
              t.setDaemon(true);
              return t;
            });
    
    // Schedule maintenance cleanup every 5 minutes to prevent memory leaks
    this.maintenanceExecutor.scheduleWithFixedDelay(
        this::performMaintenanceCleanup,
        5, // initial delay
        5, // period
        TimeUnit.MINUTES);
    Complex Synchronization

    The IndexedSessionMap uses a single coordination lock for all operations that need to maintain consistency between the two maps. This could become a bottleneck under high concurrency and the synchronization logic is complex with potential for deadlocks or performance issues.

      private static class IndexedSessionMap {
        private final ConcurrentMap<SessionId, Session> sessions = new ConcurrentHashMap<>();
        private final ConcurrentMap<URI, Set<SessionId>> sessionsByUri = new ConcurrentHashMap<>();
        // Simplified lock strategy: only for operations requiring coordination between both maps
        private final Object coordinationLock = new Object();
    
        public Session get(SessionId id) {
          // ConcurrentHashMap is already thread-safe for reads - no lock needed
          return sessions.get(id);
        }
    
        public Session put(SessionId id, Session session) {
          // Synchronize only the coordination between maps, not individual map operations
          synchronized (coordinationLock) {
            Session previous = sessions.put(id, session);
    
            // Clean up previous session's URI index if it exists
            if (previous != null && previous.getUri() != null) {
              cleanupUriIndex(previous.getUri(), id);
            }
    
            // Add new session to URI index
            URI sessionUri = session.getUri();
            if (sessionUri != null) {
              sessionsByUri.computeIfAbsent(sessionUri, k -> ConcurrentHashMap.newKeySet()).add(id);
            }
    
            return previous;
          }
        }
    
        public Session remove(SessionId id) {
          // Synchronize only the coordination between maps
          synchronized (coordinationLock) {
            Session removed = sessions.remove(id);
    
            // Clean up URI index if session existed
            if (removed != null && removed.getUri() != null) {
              cleanupUriIndex(removed.getUri(), id);
            }
    
            return removed;
          }
        }
    
        public void batchRemove(Set<SessionId> sessionIds) {
          // Synchronize the entire batch operation for consistency
          synchronized (coordinationLock) {
            Map<URI, Set<SessionId>> uriToSessionIds = new HashMap<>();
    
            // Collect URI mappings for sessions that exist
            for (SessionId id : sessionIds) {
              Session session = sessions.get(id);
              if (session != null && session.getUri() != null) {
                uriToSessionIds.computeIfAbsent(session.getUri(), k -> new HashSet<>()).add(id);
              }
            }
    
            // Remove all sessions from main map
            for (SessionId id : sessionIds) {
              sessions.remove(id);
            }
    
            // Clean up URI index for all affected URIs
            for (Map.Entry<URI, Set<SessionId>> entry : uriToSessionIds.entrySet()) {
              URI uri = entry.getKey();
              Set<SessionId> idsToRemove = entry.getValue();
              cleanupUriIndex(uri, idsToRemove);
            }
          }
        }
    
        /**
         * Robust cleanup of URI index to prevent memory leaks from empty sets. Handles single session
         * removal with explicit empty set cleanup. Uses atomic operations to prevent race conditions.
         * Note: This method should only be called from within coordinationLock synchronized blocks.
         */
        private void cleanupUriIndex(URI uri, SessionId sessionId) {
          // Use computeIfPresent for atomic operation to prevent race conditions
          sessionsByUri.computeIfPresent(
              uri,
              (key, sessionIds) -> {
                sessionIds.remove(sessionId);
                // Return null to remove the entry if empty, otherwise return the modified set
                return sessionIds.isEmpty() ? null : sessionIds;
              });
        }
    
        /**
         * Robust cleanup of URI index to prevent memory leaks from empty sets. Handles batch session
         * removal with explicit empty set cleanup. Uses atomic operations to prevent race conditions.
         * Note: This method should only be called from within coordinationLock synchronized blocks.
         */
        private void cleanupUriIndex(URI uri, Set<SessionId> sessionIdsToRemove) {
          // Use computeIfPresent for atomic operation to prevent race conditions
          sessionsByUri.computeIfPresent(
              uri,
              (key, sessionIds) -> {
                sessionIds.removeAll(sessionIdsToRemove);
                // Return null to remove the entry if empty, otherwise return the modified set
                return sessionIds.isEmpty() ? null : sessionIds;
              });
        }
    
        /**
         * Periodic cleanup to remove any empty sets that may have been missed. Should be called
         * periodically to prevent memory leaks. Enhanced with better error handling and logging.
         */
        public void performMaintenanceCleanup() {
          synchronized (coordinationLock) {
            try {
              int initialSize = sessionsByUri.size();
              int emptyUrisRemoved = 0;
    
              // Use iterator to safely remove empty entries during iteration
              var iterator = sessionsByUri.entrySet().iterator();
              while (iterator.hasNext()) {
                var entry = iterator.next();
                if (entry.getValue().isEmpty()) {
                  iterator.remove();
                  emptyUrisRemoved++;
                }
              }
    
              if (emptyUrisRemoved > 0) {
                LOG.info(
                    String.format(
                        "Maintenance cleanup removed %d empty URI entries from sessionsByUri map (size:"
                            + " %d -> %d)",
                        emptyUrisRemoved, initialSize, sessionsByUri.size()));
              }
            } catch (Exception e) {
              LOG.warning("Error during maintenance cleanup: " + e.getMessage());
            }
          }
        }
    
        public Set<SessionId> getSessionsByUri(URI uri) {
          // ConcurrentHashMap is already thread-safe for reads - no lock needed
          Set<SessionId> result = sessionsByUri.get(uri);
          // Return empty set instead of null, and ensure we don't return empty sets
          return (result != null && !result.isEmpty()) ? result : Set.of();
        }
    
        public Set<Map.Entry<SessionId, Session>> entrySet() {
          // ConcurrentHashMap provides thread-safe iteration - no lock needed
          // Create defensive copy to prevent external modification
          return new HashSet<>(sessions.entrySet());
        }
    
        public Collection<Session> values() {
          // ConcurrentHashMap provides thread-safe iteration - no lock needed
          // Create defensive copy to prevent external modification
          return new ArrayList<>(sessions.values());
        }
    
        public int size() {
          // ConcurrentHashMap size() is already thread-safe
          return sessions.size();
        }
    
        public boolean isEmpty() {
          // ConcurrentHashMap isEmpty() is already thread-safe
          return sessions.isEmpty();
        }
    
        public void clear() {
          // Synchronize clearing both maps for consistency
          synchronized (coordinationLock) {
            sessions.clear();
            sessionsByUri.clear();
          }
        }
      }
    }
    Defensive Copy Overhead

    Methods like entrySet() and values() create defensive copies using HashSet and ArrayList constructors. This could be expensive for large session maps and may impact performance during frequent access operations.

    public Set<Map.Entry<SessionId, Session>> entrySet() {
      // ConcurrentHashMap provides thread-safe iteration - no lock needed
      // Create defensive copy to prevent external modification
      return new HashSet<>(sessions.entrySet());
    }
    
    public Collection<Session> values() {
      // ConcurrentHashMap provides thread-safe iteration - no lock needed
      // Create defensive copy to prevent external modification
      return new ArrayList<>(sessions.values());
    }

    @VietND96 VietND96 requested review from diemol and joerg1985 March 4, 2025 07:52
    Copy link
    Contributor

    qodo-merge-pro bot commented Mar 4, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix potential null pointer access
    Suggestion Impact:The commit addressed the null pointer issue by modifying the shutdown hook implementation. Instead of directly calling node.drain() which could cause a NullPointerException, it changed the implementation to fire a NodeRemovedEvent through the bus. This is a different approach than suggested but solves the same underlying issue.

    code diff:

    -  private final Thread shutdownHook = new Thread(() -> node.drain());
    +  private final Thread shutdownHook =
    +      new Thread(() -> bus.fire(new NodeRemovedEvent(node.getStatus())));

    The shutdown hook initializes with a lambda that uses the node field before it's
    set, which could cause a NullPointerException. Move the shutdown hook
    initialization after the node field is populated.

    java/src/org/openqa/selenium/grid/node/httpd/NodeServer.java [77-79]

     private Node node;
     private EventBus bus;
    -private final Thread shutdownHook = new Thread(() -> node.drain());
    +private Thread shutdownHook;

    [Suggestion has been applied]

    Suggestion importance[1-10]: 9

    __

    Why: The suggestion correctly identifies a critical issue where the shutdown hook lambda accesses the 'node' field before it's initialized, which could cause a NullPointerException. Moving the shutdown hook initialization after node assignment would prevent this potential crash.

    High

    [Generating...]

    @joerg1985
    Copy link
    Member

    IMHO, this will not fix #15347, will report details later in the original issue.

    @VietND96
    Copy link
    Member Author

    VietND96 commented Mar 4, 2025

    It might not fix 15347. However, sometimes I can reproduce with a strict condition, something like
    There are multiple concurrent sessions, and new Nodes are continuously registered to Hub, each Node set drain-after-session-count=3 (this might reproduce RestartedNode since container get restarted and back to Hub). The script steps in sequentially from create to quit().
    Do you think something proceeded asynchronous?

    selenium.common.exceptions.InvalidSessionIdException: Message: Unable to find session with ID: 9d7b409b-b311-470a-b227-1944cd05ac73
    Build info: version: '4.30.0-SNAPSHOT', revision: 'Unknown'
    System info: os.name: 'Linux', os.arch: 'aarch64', os.version: '6.12.5-linuxkit', java.version: '21.0.6'
    Driver info: driver.version: unknown
    Stacktrace:
    org.openqa.selenium.NoSuchSessionException: Unable to find session with ID: 9d7b409b-b311-470a-b227-1944cd05ac73
    Build info: version: '4.30.0-SNAPSHOT', revision: 'Unknown'
    System info: os.name: 'Linux', os.arch: 'aarch64', os.version: '6.12.5-linuxkit', java.version: '21.0.6'
    Driver info: driver.version: unknown
            at org.openqa.selenium.grid.sessionmap.local.LocalSessionMap.get(LocalSessionMap.java:117)
            at org.openqa.selenium.grid.sessionmap.SessionMap.getUri(SessionMap.java:84)
            at org.openqa.selenium.grid.router.HandleSession.lambda$loadSessionId$4(HandleSession.java:225)
            at io.opentelemetry.context.Context.lambda$wrap$2(Context.java:253)
            at org.openqa.selenium.grid.router.HandleSession.execute(HandleSession.java:182)
            at org.openqa.selenium.remote.http.Route$PredicatedRoute.handle(Route.java:397)
            at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
            at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
            at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
            at org.openqa.selenium.grid.router.Router.execute(Router.java:89)
            at org.openqa.selenium.grid.web.EnsureSpecCompliantResponseHeaders.lambda$apply$0(EnsureSpecCompliantResponseHeaders.java:34)
            at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
            at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
            at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
            at org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)
            at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
            at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
            at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
            at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
            at org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:49)
            at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
            at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
            at java.base/java.lang.Thread.run(Thread.java:1583)
    
    
    

    @joerg1985
    Copy link
    Member

    Without looking deeper into this i would assume the following might happen:

    1. NodeRestartedEvent is raised
    2. A new session is started
    3. The NodeRestartedEvent.listener inside the LocalSessionMap i deleting the session, because it only checks the URL not the node id
    4. The new session is accessed -> The LocalSessionMap has killed it just before.

    Are you able to run a patched version of the Hub? If so, just remove the NodeRestartedEvent.listener inside the LocalSessionMap and see if you can still repoduce. If this is the case, the NodeRestartedEvent.listener inside all of the SessionMap's have to be reworked. But this is hard, as there are datebase changes involved ...

    @joerg1985
    Copy link
    Member

    @VietND96 could you share the hub logs when this happens?
    A NodeRestartedEvent is not rised at all in my local testcase with --drain-after-session-count=1

    @VietND96

    This comment was marked as outdated.

    @joerg1985
    Copy link
    Member

    Could you share more of the logs, it is allways good to see what happened some time before and concurrently by other sessions.

    @VietND96
    Copy link
    Member Author

    VietND96 commented Mar 4, 2025

    I attched full logs for id e33534f26859916b894c37fcb5da30df
    e33534f26859916b894c37fcb5da30df.txt
    I also shared my scripts to reproduce the issue https://github.com/NDViet/selenium-grid-stress-test (it is nice if you are using macOS or Ubuntu to run it).

    @VietND96 VietND96 changed the title [grid] Node server graceful shutdown [grid] Unable to find session with ID when Node drain-after-session-count Mar 4, 2025
    @VietND96 VietND96 added the B-grid Everything grid and server related label Mar 5, 2025
    @VietND96

    This comment was marked as outdated.

    @joerg1985
    Copy link
    Member

    joerg1985 commented Mar 6, 2025

    @joerg1985, do you think currentSessions.cleanUp() in drain() is correct?

    Yes, It should only perform allready outstanding maintenance operations.
    But i think there is no guarantee all outstanding calls to the removal listener are performed.

    When reading the caffeine issues, there are alot of 'there is no guarantee for this and that' statements.

    @VietND96 VietND96 changed the title [grid] Unable to find session with ID when Node drain-after-session-count [grid] Fix race condition in LocalSessionMap and improve logging Jul 28, 2025
    @VietND96 VietND96 requested a review from Copilot July 28, 2025 07:35
    Copilot

    This comment was marked as outdated.

    Copy link

    @Copilot Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull Request Overview

    This PR fixes a race condition in LocalSessionMap that occurred when multiple Node events (NodeRemovedEvent and NodeRestartedEvent) tried to remove sessions simultaneously, and improves logging for better visibility into session management operations.

    Key changes:

    • Introduced thread-safe session management using ReentrantReadWriteLock to prevent concurrent modification issues
    • Added IndexedSessionMap class with URI-to-SessionId indexing for efficient batch operations
    • Enhanced logging to include session IDs, node URIs, and event details for better debugging

    Signed-off-by: Viet Nguyen Duc <[email protected]>
    @VietND96 VietND96 changed the title [grid] Fix race condition in LocalSessionMap and improve logging [grid] Fix race condition and improve logging in LocalSessionMap Jul 28, 2025
    @VietND96
    Copy link
    Member Author

    /review

    Copy link
    Contributor

    Persistent review updated to latest commit db33027

    Signed-off-by: Viet Nguyen Duc <[email protected]>
    @VietND96
    Copy link
    Member Author

    /review

    Copy link
    Contributor

    Persistent review updated to latest commit 0a2f27c

    @VietND96
    Copy link
    Member Author

    /review

    Copy link
    Contributor

    Persistent review updated to latest commit 6f98efc

    Signed-off-by: Viet Nguyen Duc <[email protected]>
    @VietND96
    Copy link
    Member Author

    Ignore qodo review, since IndexedSessionMap is a custom ConcurrentMap implementation that maintains dual indexing for efficient session management in Grid:
    Primary Index: SessionId → Session mapping for direct session lookup
    Secondary Index: URI → Set mapping for node-based operations

    Key Benefits & Use Cases

    • Performance Optimization:

      • Fast session lookup: O(1) access by SessionId via primary map
      • Efficient node operations: O(1) lookup of all sessions on a specific node URI
      • Batch operations: Optimized removal of multiple sessions from the same node
    • Grid Event Handling:

      • NodeRemovedEvent: Quickly find and remove all sessions from a removed node
      • NodeRestartedEvent: Efficiently clean up sessions from a restarted node
      • Session routing: Fast determination of which node hosts a specific session

    @VietND96 VietND96 merged commit 0f831ba into trunk Jul 28, 2025
    32 of 33 checks passed
    @VietND96 VietND96 deleted the node-server branch July 28, 2025 18:08
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    B-grid Everything grid and server related Review effort 4/5
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants