Skip to content

chatglm2-6b int8+kv8 build failed on 0.8.0 branch #1239

@NaNAGISaSA

Description

@NaNAGISaSA

System Info

    - CPU architecture: x86_64
    - GPU properties
      - GPU name: NVIDIA A100
      - GPU memory size: 40G
    - Libraries
      - TensorRT-LLM branch or tag: v0.8.0
      - TensorRT-LLM commit: 5955b8afbad
      - Container used: yes, `make -C docker release_build` on v0.8.0 branch
    - NVIDIA driver version: 525.89.02
    - OS: Ubuntu 22.04

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

pip install transformers==4.33.0 # fix: https://huggingface.co/THUDM/chatglm2-6b/discussions/87

tp_size=1

python examples/chatglm/convert_checkpoint.py --model_dir ${hf_model_dir}
--tp_size ${tp_size}
--dtype float16
--use_weight_only
--weight_only_precision int8
--int8_kv_cache
--workers ${tp_size}
--output_dir ${quant_out_dir}/int8-kv8/${tp_size}-gpu/

trtllm-build --checkpoint_dir ${quant_out_dir}/int8-kv8/${tp_size}-gpu/
--output_dir ${trt_out_dir}/int8-kv8/${tp_size}-gpu/
--gemm_plugin float16
--gpt_attention_plugin float16
--context_fmha_fp32_acc enable
--remove_input_padding enable
--max_batch_size 128
--max_input_len 2048
--max_output_len 2048

Expected behavior

build success

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0
Inferring chatglm version from path...
Chatglm version: chatglm2
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00, 1.36s/it]
Calibration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:11<00:00, 5.42it/s]
Weights loaded. Total time: 00:06:07
Total time of converting checkpoints: 00:07:18
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[03/06/2024-03:40:23] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set gemm_plugin to float16.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set lookup_plugin to None.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set lora_plugin to None.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set context_fmha to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set context_fmha_fp32_acc to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set paged_kv_cache to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set remove_input_padding to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set multi_block_mode to False.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set enable_xqa to True.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set tokens_per_block to 128.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[03/06/2024-03:40:23] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[03/06/2024-03:40:23] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len.
It may not be optimal to set max_num_tokens=max_batch_size
max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 497, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 420, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 392, in build_and_save
engine = build(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 272, in build
model.load(weights)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 338, in load
raise RuntimeError(err_msg)
RuntimeError: Provided tensor names are different from those expected by the engine.
Provided but not expected tensors: {'transformer.layers.2.attention.dense.act_scale', 'transformer.layers.25.attention.quantization_scaling_factor', 'transformer.layers.22.mlp.quantization_scaling_factor', 'transformer.layers.26.mlp.fc.act_scale', 'transformer.layers.24.attention.dense.act_scale', 'transformer.layers.18.mlp.fc.act_scale', 'transformer.layers.6.mlp.fc.act_scale', 'transformer.layers.19.mlp.fc.act_scale', 'transformer.layers.0.input_layernorm.scale_to_int', 'transformer.layers.4.attention.dense.act_scale', 'transformer.layers.21.mlp.fc.act_scale', 'transformer.layers.0.mlp.proj.act_scale', 'transformer.layers.16.post_layernorm.scale_to_int', 'transformer.layers.24.attention.quantization_scaling_factor', 'transformer.layers.17.attention.quantization_scaling_factor', 'transformer.layers.10.input_layernorm.scale_to_int', 'transformer.layers.0.mlp.fc.act_scale', 'transformer.layers.19.attention.quantization_scaling_factor', 'transformer.layers.15.mlp.fc.act_scale', 'transformer.layers.6.mlp.proj.act_scale', 'transformer.layers.9.attention.qkv.act_scale', 'transformer.layers.10.attention.dense.act_scale', 'transformer.layers.23.mlp.quantization_scaling_factor', 'transformer.layers.4.mlp.quantization_scaling_factor', 'transformer.layers.17.mlp.fc.act_scale', 'transformer.layers.21.input_layernorm.scale_to_int', 'transformer.layers.21.attention.dense.act_scale', 'transformer.layers.9.mlp.proj.act_scale', 'transformer.layers.1.mlp.proj.act_scale', 'transformer.layers.13.mlp.quantization_scaling_factor', 'transformer.layers.9.attention.dense.act_scale', 'transformer.layers.12.input_layernorm.scale_to_int', 'transformer.layers.21.attention.quantization_scaling_factor', 'transformer.layers.23.attention.quantization_scaling_factor', 'transformer.layers.14.mlp.quantization_scaling_factor', 'transformer.layers.16.input_layernorm.scale_to_int', 'transformer.layers.12.attention.quantization_scaling_factor', 'transformer.layers.11.attention.qkv.act_scale', 'transformer.layers.11.input_layernorm.scale_to_int', 'transformer.layers.26.post_layernorm.scale_to_int', 'transformer.layers.4.mlp.proj.act_scale', 'transformer.layers.5.mlp.fc.act_scale', 'transformer.layers.23.mlp.fc.act_scale', 'transformer.layers.26.attention.qkv.act_scale', 'transformer.layers.0.attention.quantization_scaling_factor', 'transformer.layers.2.attention.quantization_scaling_factor', 'transformer.layers.25.input_layernorm.scale_to_int', 'transformer.layers.19.input_layernorm.scale_to_int', 'transformer.layers.26.attention.quantization_scaling_factor', 'transformer.layers.21.mlp.proj.act_scale', 'transformer.layers.2.input_layernorm.scale_to_int', 'transformer.layers.25.mlp.proj.act_scale', 'transformer.layers.23.mlp.proj.act_scale', 'transformer.layers.15.attention.qkv.act_scale', 'transformer.layers.16.mlp.proj.act_scale', 'transformer.layers.8.mlp.proj.act_scale', 'transformer.layers.17.input_layernorm.scale_to_int', 'transformer.layers.1.attention.quantization_scaling_factor', 'transformer.layers.16.mlp.fc.act_scale', 'transformer.layers.1.attention.qkv.act_scale', 'transformer.layers.5.input_layernorm.scale_to_int', 'transformer.layers.4.mlp.fc.act_scale', 'transformer.layers.10.attention.quantization_scaling_factor', 'transformer.layers.9.mlp.quantization_scaling_factor', 'transformer.layers.22.mlp.proj.act_scale', 'transformer.layers.8.attention.dense.act_scale', 'transformer.layers.22.input_layernorm.scale_to_int', 'transformer.layers.27.attention.dense.act_scale', 'transformer.layers.27.attention.qkv.act_scale', 'transformer.layers.3.input_layernorm.scale_to_int', 'transformer.layers.13.mlp.proj.act_scale', 'transformer.layers.24.mlp.proj.act_scale', 'transformer.layers.15.mlp.proj.act_scale', 'transformer.layers.22.post_layernorm.scale_to_int', 'transformer.layers.6.input_layernorm.scale_to_int', 'transformer.layers.19.mlp.quantization_scaling_factor', 'transformer.layers.8.mlp.quantization_scaling_factor', 'transformer.layers.13.post_layernorm.scale_to_int', 'transformer.layers.20.post_layernorm.scale_to_int', 'transformer.layers.11.attention.dense.act_scale', 'transformer.layers.1.mlp.quantization_scaling_factor', 'transformer.layers.20.attention.qkv.act_scale', 'transformer.layers.23.attention.dense.act_scale', 'transformer.layers.18.attention.dense.act_scale', 'transformer.layers.7.attention.quantization_scaling_factor', 'transformer.layers.22.attention.qkv.act_scale', 'transformer.layers.7.attention.qkv.act_scale', 'transformer.layers.26.mlp.quantization_scaling_factor', 'transformer.layers.22.mlp.fc.act_scale', 'transformer.layers.11.post_layernorm.scale_to_int', 'transformer.layers.2.post_layernorm.scale_to_int', 'transformer.layers.3.attention.qkv.act_scale', 'transformer.layers.17.post_layernorm.scale_to_int', 'transformer.layers.24.input_layernorm.scale_to_int', 'transformer.layers.10.mlp.quantization_scaling_factor', 'transformer.layers.3.post_layernorm.scale_to_int', 'transformer.layers.3.mlp.fc.act_scale', 'transformer.layers.12.mlp.proj.act_scale', 'transformer.layers.8.mlp.fc.act_scale', 'transformer.layers.4.attention.quantization_scaling_factor', 'transformer.layers.6.mlp.quantization_scaling_factor', 'transformer.layers.6.attention.quantization_scaling_factor', 'transformer.layers.27.mlp.proj.act_scale', 'transformer.layers.5.mlp.proj.act_scale', 'transformer.layers.12.mlp.fc.act_scale', 'transformer.layers.15.input_layernorm.scale_to_int', 'transformer.layers.24.post_layernorm.scale_to_int', 'transformer.layers.5.post_layernorm.scale_to_int', 'transformer.layers.23.post_layernorm.scale_to_int', 'transformer.layers.3.attention.dense.act_scale', 'transformer.layers.20.input_layernorm.scale_to_int', 'transformer.layers.7.mlp.fc.act_scale', 'transformer.layers.17.mlp.proj.act_scale', 'transformer.layers.20.attention.quantization_scaling_factor', 'transformer.layers.27.mlp.quantization_scaling_factor', 'transformer.layers.14.attention.quantization_scaling_factor', 'transformer.layers.11.attention.quantization_scaling_factor', 'transformer.layers.23.attention.qkv.act_scale', 'transformer.layers.17.attention.qkv.act_scale', 'transformer.layers.7.post_layernorm.scale_to_int', 'transformer.layers.9.post_layernorm.scale_to_int', 'transformer.layers.9.input_layernorm.scale_to_int', 'transformer.layers.14.mlp.fc.act_scale', 'transformer.layers.14.attention.qkv.act_scale', 'transformer.layers.3.mlp.quantization_scaling_factor', 'transformer.layers.0.mlp.quantization_scaling_factor', 'transformer.layers.18.post_layernorm.scale_to_int', 'transformer.layers.10.mlp.proj.act_scale', 'transformer.layers.7.mlp.quantization_scaling_factor', 'transformer.layers.13.attention.dense.act_scale', 'transformer.layers.17.mlp.quantization_scaling_factor', 'transformer.layers.27.attention.quantization_scaling_factor', 'transformer.layers.17.attention.dense.act_scale', 'transformer.layers.15.post_layernorm.scale_to_int', 'transformer.layers.18.attention.quantization_scaling_factor', 'transformer.layers.14.attention.dense.act_scale', 'transformer.layers.19.attention.qkv.act_scale', 'transformer.layers.8.input_layernorm.scale_to_int', 'transformer.layers.24.attention.qkv.act_scale', 'transformer.layers.19.attention.dense.act_scale', 'transformer.layers.2.mlp.quantization_scaling_factor', 'transformer.layers.22.attention.dense.act_scale', 'transformer.layers.15.attention.dense.act_scale', 'transformer.layers.12.attention.qkv.act_scale', 'transformer.layers.25.mlp.fc.act_scale', 'transformer.layers.12.post_layernorm.scale_to_int', 'transformer.layers.26.attention.dense.act_scale', 'transformer.layers.13.input_layernorm.scale_to_int', 'transformer.layers.1.input_layernorm.scale_to_int', 'transformer.layers.10.mlp.fc.act_scale', 'transformer.layers.3.mlp.proj.act_scale', 'transformer.layers.11.mlp.proj.act_scale', 'transformer.layers.24.mlp.fc.act_scale', 'transformer.layers.23.input_layernorm.scale_to_int', 'transformer.layers.12.mlp.quantization_scaling_factor', 'transformer.layers.2.mlp.fc.act_scale', 'transformer.layers.4.attention.qkv.act_scale', 'transformer.layers.6.attention.qkv.act_scale', 'transformer.layers.9.mlp.fc.act_scale', 'transformer.layers.26.input_layernorm.scale_to_int', 'transformer.layers.19.mlp.proj.act_scale', 'transformer.layers.18.mlp.quantization_scaling_factor', 'transformer.layers.25.attention.qkv.act_scale', 'transformer.layers.21.post_layernorm.scale_to_int', 'transformer.layers.2.attention.qkv.act_scale', 'transformer.layers.15.mlp.quantization_scaling_factor', 'transformer.layers.7.input_layernorm.scale_to_int', 'transformer.layers.6.post_layernorm.scale_to_int', 'transformer.layers.18.input_layernorm.scale_to_int', 'transformer.layers.13.mlp.fc.act_scale', 'transformer.layers.14.mlp.proj.act_scale', 'transformer.layers.1.attention.dense.act_scale', 'transformer.layers.13.attention.quantization_scaling_factor', 'transformer.layers.10.attention.qkv.act_scale', 'transformer.layers.1.mlp.fc.act_scale', 'transformer.layers.7.attention.dense.act_scale', 'transformer.layers.22.attention.quantization_scaling_factor', 'transformer.layers.14.post_layernorm.scale_to_int', 'transformer.layers.6.attention.dense.act_scale', 'transformer.layers.24.mlp.quantization_scaling_factor', 'transformer.layers.9.attention.quantization_scaling_factor', 'transformer.layers.2.mlp.proj.act_scale', 'transformer.layers.13.attention.qkv.act_scale', 'transformer.layers.16.attention.dense.act_scale', 'transformer.layers.5.attention.qkv.act_scale', 'transformer.layers.5.attention.quantization_scaling_factor', 'transformer.layers.11.mlp.fc.act_scale', 'transformer.layers.3.attention.quantization_scaling_factor', 'transformer.layers.27.mlp.fc.act_scale', 'transformer.layers.20.attention.dense.act_scale', 'transformer.layers.21.mlp.quantization_scaling_factor', 'transformer.layers.25.attention.dense.act_scale', 'transformer.layers.8.post_layernorm.scale_to_int', 'transformer.layers.8.attention.qkv.act_scale', 'transformer.layers.15.attention.quantization_scaling_factor', 'transformer.layers.27.post_layernorm.scale_to_int', 'transformer.layers.7.mlp.proj.act_scale', 'transformer.layers.4.input_layernorm.scale_to_int', 'transformer.layers.0.post_layernorm.scale_to_int', 'transformer.layers.16.mlp.quantization_scaling_factor', 'transformer.layers.1.post_layernorm.scale_to_int', 'transformer.layers.20.mlp.quantization_scaling_factor', 'transformer.layers.16.attention.qkv.act_scale', 'transformer.layers.5.attention.dense.act_scale', 'transformer.layers.20.mlp.proj.act_scale', 'transformer.layers.21.attention.qkv.act_scale', 'transformer.layers.11.mlp.quantization_scaling_factor', 'transformer.layers.0.attention.dense.act_scale', 'transformer.layers.25.mlp.quantization_scaling_factor', 'transformer.layers.18.mlp.proj.act_scale', 'transformer.layers.26.mlp.proj.act_scale', 'transformer.layers.5.mlp.quantization_scaling_factor', 'transformer.layers.20.mlp.fc.act_scale', 'transformer.layers.18.attention.qkv.act_scale', 'transformer.layers.16.attention.quantization_scaling_factor', 'transformer.layers.12.attention.dense.act_scale', 'transformer.layers.25.post_layernorm.scale_to_int', 'transformer.layers.8.attention.quantization_scaling_factor', 'transformer.layers.0.attention.qkv.act_scale', 'transformer.layers.10.post_layernorm.scale_to_int', 'transformer.layers.14.input_layernorm.scale_to_int', 'transformer.layers.19.post_layernorm.scale_to_int', 'transformer.layers.4.post_layernorm.scale_to_int', 'transformer.layers.27.input_layernorm.scale_to_int'}

additional notes

none

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions