-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
System Info
GPU NVIDIA L20
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am trying to quantize CodeQwen1.5 7B Chat to FP8 using a modified version of the example quantization script:
python quantization/quantize.py --model_dir /mnt/models/CodeQwen1.5-7B-Chat \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir /mnt/trt_models/codeqwen1.5_7b_checkpoint_1gpu_fp8_fp8kv \
--calib_size 512 \
--calib_dataset /mnt/dataset/cnn_dailymail
Expected behavior
The outside quantize.py will use quantize_and_export()
to run quantization, and it is defined inside https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/quantization/quantize_by_modelopt.py
get_tokenizer
will automatically read the tokenizer from my model_dir
and set the pad_token
as well as the eos_token
.
actual behavior
But it failed to set the pad_token:
[07/16/2024-13:46:30] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[07/16/2024-13:46:30] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level by INFO
[07/16/2024-13:46:30] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100
Initializing model from /mnt/models/CodeQwen1.5-7B-Chat
[07/16/2024-13:47:14] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:28<00:00, 7.20s/it]
[TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch.bfloat16.
Initializing tokenizer from /mnt/models/CodeQwen1.5-7B-Chat
Traceback (most recent call last):
File "quantization/quantize.py", line 90, in <module>
quantize_and_export(
File "/opt/conda/lib/python3.8/site-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 289, in quantize_and_export
tokenizer = get_tokenizer(model_dir,
File "/opt/conda/lib/python3.8/site-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 147, in get_tokenizer
assert tokenizer.pad_token is not None, f"Pad token for {model_type} cannot be set!"
AssertionError: Pad token for qwen cannot be set!
additional notes
I commented out some lines except for the AutoTokenizer.from_pretrained()
to get this case worked.
def get_tokenizer(ckpt_path, max_seq_length=2048, model_type=None):
print(f"Initializing tokenizer from {ckpt_path}")
tokenizer = AutoTokenizer.from_pretrained(
ckpt_path,
model_max_length=max_seq_length,
padding_side="left",
trust_remote_code=True,
)
# if model_type and model_type == "qwen":
# # qwen use token id 151643 as pad and eos tokens
# tokenizer.pad_token = tokenizer.convert_ids_to_tokens(151643)
# tokenizer.eos_token = tokenizer.convert_ids_to_tokens(151643)
# # can't set attribute 'pad_token' for "<unk>"
# if tokenizer.pad_token != "<unk>": # nosec B105
# tokenizer.pad_token = tokenizer.eos_token
# if tokenizer.pad_token is None:
# tokenizer.pad_token = tokenizer.eos_token
# assert tokenizer.pad_token is not None, f"Pad token for {model_type} cannot be set!"
return tokenizer
I know that commenting out these lines will certainly affect other model's conversion. It seems there needs to be a fix on this function to support CodeQwen1.5.