Skip to content

Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552

@sleepwalker2017

Description

@sleepwalker2017

System Info

GPU 2* A30, TRT-LLM branch main, commid id: 66ef1df

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

MODEL_CHECKPOINT=/data/models/vicuna-7b-v1.5/
CONVERTED_CHECKPOINT=Llama-7b-hf-ckpt

DTYPE=float16
TP=2

echo "step 1: convert checkpoint"
# Build lora enabled engine
python convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
                              --output_dir ${CONVERTED_CHECKPOINT} \
                              --dtype ${DTYPE} \
                              --tp_size ${TP} \
                              --pp_size 1

SOURCE_LORA=/data/Llama2-Chinese-7b-Chat-LoRA/
#SOURCE_LORA=/data/llama2-7b-lora.tar.gz
CPP_LORA=chinese-llama-2-lora-7b-cpp

EG_DIR=/tmp/lora-eg

PP=1
MAX_LEN=1024
MAX_BATCH=16
TOKENIZER=/data/models/vicuna-7b-v1.5/
LORA_ENGINE=Llama-2-7b-hf-engine
NUM_LORAS=(8)
NUM_REQUESTS=200

echo "step 2: trtllm-build"
trtllm-build \
    --checkpoint_dir ${CONVERTED_CHECKPOINT} \
    --output_dir ${LORA_ENGINE} \
    --max_batch_size ${MAX_BATCH} \
    --max_input_len $MAX_LEN \
    --max_output_len $MAX_LEN \
    --gpt_attention_plugin float16 \
    --paged_kv_cache enable \
    --remove_input_padding enable \
    --gemm_plugin float16 \
    --lora_plugin float16 \
    --use_paged_context_fmha enable \
    --use_custom_all_reduce disable \
    --lora_target_modules attn_qkv attn_dense mlp_h_to_4h mlp_gate mlp_4h_to_h
echo "step 3: Convert LoRA to cpp format"
# Convert LoRA to cpp format
python ../hf_lora_convert.py \
    -i $SOURCE_LORA \
    --storage-type $DTYPE \
    -o $CPP_LORA

echo "step 4: prepare dataset for non-lora requests"
mkdir -p $EG_DIR/data
python ../../benchmarks/cpp/prepare_dataset.py \
    --output ${EG_DIR}/data/token-norm-dist.json \
    --request-rate -1 \
    --time-delay-dist constant \
    --tokenizer $TOKENIZER \
    token-norm-dist \
    --num-requests $NUM_REQUESTS \
    --input-mean 256 --input-stdev 16 --output-mean 128 --output-stdev 24

echo "step 5: prepare dataset for lora requests"
for nloras in ${NUM_LORAS[@]}; do
    python ../../benchmarks/cpp/prepare_dataset.py \
        --output "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
        --request-rate -1 \
        --time-delay-dist constant \
        --rand-task-id 0 $(( $nloras - 1 )) \
        --tokenizer $TOKENIZER \
        token-norm-dist \
        --num-requests $NUM_REQUESTS \
        --input-mean 256 --input-stdev 16 --output-mean 128 --output-stdev 24
done

mkdir -p ${EG_DIR}/log-base-lora

NUM_LAYERS=32
NUM_LORA_MODS=8
MAX_LORA_RANK=8
EOS_ID=-1
mpirun -n ${TP} --allow-run-as-root --output-filename ${EG_DIR}/log-base-lora \
    ../../cpp/build/benchmarks/gptManagerBenchmark \
    --engine_dir $LORA_ENGINE \
    --type IFB \
    --dataset "${EG_DIR}/data/token-norm-dist-lora-8.json" \
    --lora_host_cache_bytes 8589934592 \
    --lora_num_device_mod_layers $(( 8 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
    --kv_cache_free_gpu_mem_fraction 0.80 \
    --log_level info \
    --eos_id ${EOS_ID}

Expected behavior

Failed to run gptManager benchmark

actual behavior

[TensorRT-LLM][ERROR] Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182)
1       0x5572c6dedde9 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f56c6cd5378 /data/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x69c378) [0x7f56c6cd5378]
3       0x7f56c8c3f03f tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 127
4       0x7f56c8c03078 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 1464
5       0x7f56c8c0342a tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 170
6       0x7f56c64dd253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f56c64dd253]
7       0x7f56c624cac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f56c624cac3]
8       0x7f56c62de850 /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f56c62de850]

additional notes

none

Metadata

Metadata

Labels

bugSomething isn't workingstaletriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions