-
Notifications
You must be signed in to change notification settings - Fork 1.8k

Description
System Info
GPU : NVIDIA A100 80GB
package version
tensorrt-9.2.0.post12.dev5-cp310-none-linux_x86_64.whl
[TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
-
Installation
python -m pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com
-
Create smoothquant checkpoint for LLaMA
python ./examples/llama/convert_checkpoint.py --model_dir ~/Llama-2-13b-chat-hf --output_dir ~/fp16-tp4-sq5 --dtype float16 --tp_size 4 --smoothquant 0.5 --per_token --per_channel --workers 4
Expected behavior
Checkpoint should be created.
actual behavior
Error at line - https://github.com/NVIDIA/TensorRT-LLM/blob/v0.8.0/examples/llama/convert_checkpoint.py#L1502
ValueError: You are trying to save a non contiguous tensor: transformer.layers.0.attention.qkv.weight
which is not allowed. It either means you are trying to save tensors which are reference of each other in which case it's recommended to save only the full tensors, and reslice at load time, or simply call .contiguous()
on your tensor to pack it before saving.
additional notes
No such error seen on release 0.7.1
My guess is that the function get_tllm_linear_sq_weight
returns some non-contiguous tensors.