Support loading models trained with different model_parallel_world_size. #16

linziyi96 · 2023-08-03T08:29:25Z

No description provided.

This commit makes two changes during model creation: 1. Decouples promote_trainable_params_to_fp32 from model __init__. This is to avoid casting to fp32 to save memory in inference-only mode (#4). 2. Use a context manager to manage default tensor type change. In the previous version, the default tensor type is reset to torch.FloatTensor after creating the vision model, which is technically incorrect and should be the previous default tensor type instead. We implement our own context manager because the official context managers seem to be incomplete at this time (PyTorch 2.0.1): No dtype manager is provided and set_default_device is ineffective to the torch.Tensor calls which are used in fairscale.

It is probably safer to keep CLIP at its original precision (e.g., fp16) regardless of the autocast setting: Some casting (e.g., from fp16 to bf16) may be lossy and can potentially harm the pre-trained model. Keep the changes to llama.py only at this moment since a lot of copy- pasted codes may be refactored in the future (#3).

Checkpoint merge is suported in misc/tensor_parallel.py. Merge requires that the checkpoint_mp_world_size % mp_world_size == 0. Support for split (i.e., when mp_world_size % checkpoint_mp_world_size == 0) and redistribute (for general mp_world_size and checkpoint_mp_world_size values) will be added in the future. Also changing multi_turn demo to use the new loading function with merge support.

linziyi96 added 10 commits August 2, 2023 18:19

temp save

511f267

quick fix of demo memory issue

0043e8f

Respect args.precision when saving checkpoints.

e0249e2

Merge branch 'multi_turn_memory_fix_2' into multi_turn_demo_dev

d0e43da

move printing trainable params

68c5897

Merge branch 'main' into multi_turn_demo_dev

0e6ac40

move training model creation back to cpu

1f1f815

linziyi96 merged commit ab84c96 into main Aug 4, 2023

linziyi96 deleted the multi_turn_demo_dev branch August 20, 2023 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support loading models trained with different model_parallel_world_size. #16

Support loading models trained with different model_parallel_world_size. #16

Uh oh!

linziyi96 commented Aug 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support loading models trained with different model_parallel_world_size. #16

Support loading models trained with different model_parallel_world_size. #16

Uh oh!

Conversation

linziyi96 commented Aug 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants