Skip to content

Conversation

@gulsumgudukbay
Copy link

Some tests that require multiple GPUs were executed in the run_single_gpu.py script, which was an issue because this script only dedicates a single GPU per each test.

Therefore, this PR identified the tests that require multiple GPUs and migrated them to run_multi_gpu.sh script and excluded those tests from the run_single_gpu.py script.

After this is merged, the abort support that lives in run_single_gpu.py will be also added to run_multi_gpu.sh script.

@gulsumgudukbay gulsumgudukbay requested a review from a team as a code owner August 12, 2025 05:10
--reruns 3 \
tests/pmap_test.py
# Multi-GPU test files
MULTI_GPU_TESTS=(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume, this variable must have the same definition in both runners, right?
Then perhaps you could just refactor it to a separate script that both runners will source to get the same variable value?
This would simplify and robustify maintenance of the tests..

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, the second runner is a python script actually... That makes things more annoying, but still doable. I fear, definitions could easily go out of sync if left as is. Even now it's hard to validate they are in sync..

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on this, @gulsumgudukbay ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on it

@gulsumgudukbay gulsumgudukbay requested a review from Arech8 August 12, 2025 19:24
Copy link
Collaborator

@charleshofer charleshofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, other than the problem of duplicating the list multi-GPU tests like Aleksei mentioned. I'm okay to merge after that's fixed.

Copy link

@Arech8 Arech8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, Gulsum, thanks! I learned about mapfile :)

@gulsumgudukbay
Copy link
Author

Great job, Gulsum, thanks! I learned about mapfile :)

thanks! yep, I learned it new as well, such a nice tool.

@gulsumgudukbay gulsumgudukbay merged commit f697dde into rocm-jaxlib-v0.6.0 Aug 15, 2025
@gulsumgudukbay gulsumgudukbay deleted the migrate_mgpu_tests branch August 15, 2025 06:45
zahiqbal pushed a commit that referenced this pull request Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants