Skip to content

Conversation

@frost-intel
Copy link
Contributor

Fixes #2135

Failing ProcessGroupXCCL tests were caused by the group ID not being properly set due to redefinition of some terms that were defined in the parent Backend::Options class.

In addition, I removed part of the test which checks on completed work, since the completed work will be recorded more correctly as part of an update to add the Watchdog functionality to ProcessGroupXCCL.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes failing ProcessGroupXCCL tests by addressing two issues: correcting the group ID initialization in the Backend::Options class and temporarily disabling assertions related to completed work tracking. The completed work tracking will be properly implemented in a future update that adds Watchdog functionality.

Key changes:

  • Modernized string formatting from % operator to f-strings for device creation
  • Commented out assertions checking last_completed_collective status until Watchdog implementation is complete

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks~

@guangyey guangyey added this pull request to the merge queue Nov 4, 2025
Merged via the queue into main with commit 845bdfd Nov 4, 2025
24 of 25 checks passed
@guangyey guangyey deleted the frost/pgxccl_trace_test_fix branch November 4, 2025 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[distributed] test_short_pickle_include_collectives tests fail

3 participants