Potential hangup on AMD gpus with SDMA #2070
                  
                    
                      MartinGirard
                    
                  
                
                  started this conversation in
                General
              
            Replies: 1 comment
-
| We have not seen this problem on Frontier. | 
Beta Was this translation helpful? Give feedback.
                  
                    0 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
I've been running simulations on MI300A GPUs. Occasionally, simulations simply hangup with no errors, sometimes after 10s of millions of MD steps. Attaching a debugger shows that it is stuck in a synchronization, i.e. waiting for device to finish.
My compute center has advised to set HSA_ENABLE_SDMA=0. The small tests I ran seem to indicate that this solves the problem, but since the bug is stochastic, this makes diagnostics very complicated.
Has this issue been seen elsewhere?
Beta Was this translation helpful? Give feedback.
All reactions