Skip to content
Discussion options

You must be logged in to vote

Hi @simey1128,

DPU has not been committed to DeepSpeed main repo because it introduces 1-step staleness in parameters and changes loss slightly in each training step. Because of this, it cannot pass the unit tests in DeepSpeed that we added for checking the correctness of system optimizations.

To enable delay parameter update, the files that need to be changed can be found in the repo https://github.com/jren73/delay_param_update. Also, note that this implementation was based on DeepSpeed v0.3.0.

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@simey1128
Comment options

@taehyunzzz
Comment options

@simey1128
Comment options

@taehyunzzz
Comment options

Answer selected by simey1128
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants