Skip to content

Conversation

@a-r-r-o-w
Copy link
Contributor

Not quite sure what causes the following test failures: https://github.com/huggingface/diffusers/actions/runs/13384935248/job/37379809841#step:6:10497

From some debugging, the following seems to be happening:

  • test_disk_offload_without_safetensors and test_disk_offload_with_safetensors runs first. This adds Accelerate hooks to handle device map correctly on the model
  • When we run group offloading tests, for some reason, all the new instances of the models that are created contain Accelerate hooks as well.

This makes me believe Accelerate is applying hooks at the class-level instead of the instance-level (I'm not quite sure yet & will look into accelerate code as soon as I can).

I've added a new test (just for repro purposes) that shows the above behaviour consistently happens on some models. But for some models, it works without problems 🤷‍♂️

pytest -s tests/models/transformers/test_models_transformer_cogvideox.py -k test_error_when_disk_offload_run_together_with_group_offloading
FAILED tests/models/transformers/test_models_transformer_cogvideox.py::CogVideoX1_5TransformerTests::test_error_when_disk_offload_run_together_with_group_offloading - ValueError: Cannot apply group offloading to a module that is already applying an alternative offloading strategy from Accelerate. If you want to apply group offloading, please disable the existing offloading strategy first. Offending module: time_embedding.act (<class 'torch.nn.modules.activation.SiLU'>)

cc @SunMarc @DN6

@SunMarc
Copy link
Member

SunMarc commented Feb 19, 2025

The above PR should fix your issue ! Feel free to merge this PR so that we don't have the same issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants