-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
Description
System Info
GPU Type: A6000
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
git clone https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
python3 ./convert_checkpoint.py --model_dir ./Qwen2-0.5B-Instruct --output_dir ./tllm_checkpoint_1gpu_sq --dtype float16 --smoothquant 0.5
Expected behavior
Successfully convert and save model checkpoints
actual behavior
Cloning into 'Qwen2-0.5B-Instruct'...
remote: Enumerating objects: 33, done.
remote: Counting objects: 100% (30/30), done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 33 (delta 12), reused 0 (delta 0), pack-reused 3 (from 1)
Unpacking objects: 100% (33/33), 3.60 MiB | 6.54 MiB/s, done.
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024073000
0.12.0.dev2024073000
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/datasets/load.py:1429: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
calibrating model: 0%| | 0/512 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
calibrating model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:20<00:00, 24.44it/s]
Weights loaded. Total time: 00:00:18
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/qwen/./convert_checkpoint.py", line 309, in <module>
main()
File "/app/tensorrt_llm/examples/qwen/./convert_checkpoint.py", line 301, in main
convert_and_save_hf(args)
File "/app/tensorrt_llm/examples/qwen/./convert_checkpoint.py", line 228, in convert_and_save_hf
QWenForCausalLM.quantize(args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 380, in quantize
convert.quantize(hf_model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1207, in quantize
safetensors.torch.save_file(
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 284, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 480, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'transformer.vocab_embedding.weight', 'lm_head.weight'}].
A potential way to correctly save your model is to use `save_model`.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
additional notes
transformer version: 4.42.4
TensorRT-LLM version: "0.12.0.dev2024073000"