You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/training/distributed_inference.md
+123-1Lines changed: 123 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
10
10
specific language governing permissions and limitations under the License.
11
11
-->
12
12
13
-
# Distributed inference with multiple GPUs
13
+
# Distributed inference
14
14
15
15
On distributed setups, you can run inference across multiple GPUs with 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) or [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html), which is useful for generating with multiple prompts in parallel.
> You can use `device_map` within a [`DiffusionPipeline`] to distribute its model-level components on multiple devices. Refer to the [Device placement](../tutorials/inference_with_big_models#device-placement) guide to learn more.
112
+
113
+
## Model sharding
114
+
115
+
Modern diffusion systems such as [Flux](../api/pipelines/flux) are very large and have multiple models. For example, [Flux.1-Dev](https://hf.co/black-forest-labs/FLUX.1-dev) is made up of two text encoders - [T5-XXL](https://hf.co/google/t5-v1_1-xxl) and [CLIP-L](https://hf.co/openai/clip-vit-large-patch14) - a [diffusion transformer](../api/models/flux_transformer), and a [VAE](../api/models/autoencoderkl). With a model this size, it can be challenging to run inference on consumer GPUs.
116
+
117
+
Model sharding is a technique that distributes models across GPUs, making inference with large models possible. The example below assumes two 16GB GPUs are available for inference.
118
+
119
+
Start by computing the text embeddings with the text encoders. Keep the text encoders on two GPUs by setting `device_map="balanced"`. The `balanced` strategy evenly distributes the model on all available GPUs. Assign each text encoder the maximum amount of memory available on each GPU with the `max_memory` parameter.
120
+
121
+
> [!TIP]
122
+
> **Only** load the text encoders for this step! The diffusion transformer and VAE are loaded in a later step to preserve memory.
Once the text embeddings are computed, remove them from the GPU to make space for the diffusion transformer.
146
+
147
+
```py
148
+
import gc
149
+
150
+
defflush():
151
+
gc.collect()
152
+
torch.cuda.empty_cache()
153
+
torch.cuda.reset_max_memory_allocated()
154
+
torch.cuda.reset_peak_memory_stats()
155
+
156
+
del pipeline.text_encoder
157
+
del pipeline.text_encoder_2
158
+
del pipeline.tokenizer
159
+
del pipeline.tokenizer_2
160
+
del pipeline
161
+
162
+
flush()
163
+
```
164
+
165
+
Load the diffusion transformer next which has 12.5B parameters. This time, set `device_map="auto"` to automatically distribute the model across two 16GB GPUs. The `auto` strategy is backed by [Accelerate](https://hf.co/docs/accelerate/index) and available as a part of the [Big Model Inference](https://hf.co/docs/accelerate/concept_guides/big_model_inference) feature. It starts by distributing a model across the fastest device first (GPU) before moving to slower devices like the CPU and hard drive if needed.
166
+
167
+
Assign the maximum amount of memory for each GPU to allocate with the `max_memory` parameter.
> Try `print(transformer.hf_device_map)` to see how the model is distributed across devices.
184
+
185
+
Add the transformer model to the pipeline for denoising, but set the other model-level components like the text encoders and VAE to `None` because you don't need them yet.
186
+
187
+
```py
188
+
pipeline = FluxPipeline.from_pretrained(
189
+
"black-forest-labs/FLUX.1-dev", ,
190
+
text_encoder=None,
191
+
text_encoder_2=None,
192
+
tokenizer=None,
193
+
tokenizer_2=None,
194
+
vae=None,
195
+
transformer=transformer,
196
+
torch_dtype=torch.bfloat16
197
+
)
198
+
199
+
print("Running denoising.")
200
+
height, width =768, 1360
201
+
latents = pipeline(
202
+
prompt_embeds=prompt_embeds,
203
+
pooled_prompt_embeds=pooled_prompt_embeds,
204
+
num_inference_steps=50,
205
+
guidance_scale=3.5,
206
+
height=height,
207
+
width=width,
208
+
output_type="latent",
209
+
).images
210
+
```
211
+
212
+
Finally, decode the latents with the VAE into an image. The VAE is typically small enough to be loaded on a single GPU.
213
+
214
+
```py
215
+
from diffusers import AutoencoderKL
216
+
from diffusers.image_processor import VaeImageProcessor
By selectively loading and unloading the models you need at a given stage and sharding the largest models across multiple GPUs, it is possible to run inference with large models on consumer GPUs.
0 commit comments