Fail to run Medusa IFB with triton inference server

### System Info

GPU: A30
GPU memory: 24G
TensorRT-LLM:  0.9.0.dev2024040900
CUDA: 12.3
OS: unbuntu 20.04

### Who can help?

@byshiue

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I use the medusa model with attn_bias=true  and I modified the code examples/medusa/convert_checkpoint.py to fix the model
**1. Convert model**
python3 examples/medusa/convert_checkpoint.py --workers 16 --model_dir /models/production/chat_legal_llama --output_dir /models/medusa/chat_legal_medusa/tensorrt_llm/c-model --dtype float16 --tp_size 2 --pp_size 1 --medusa_model_dir /models/medusa/chat_legal_medusa --fixed_num_medusa_heads 5 --max_medusa_token_len 63
**2. Build engine**
trtllm-build --workers 16 --tp_size 2 --pp_size 1 --checkpoint_dir=/models/medusa/chat_legal_medusa/tensorrt_llm/c-model --output_dir=/models/medusa/chat_legal_medusa/tensorrt_llm/engine --use_custom_all_reduce disable --gemm_plugin float16 --gpt_attention_plugin float16 --use_paged_context_fmha enable --paged_kv_cache enable --remove_input_padding enable --context_fmha enable --multi_block_mode enable --max_batch_size 2 --max_beam_width 1 --max_input_len 4096 --max_output_len 1024
**3. Deploy model with triton inference server**

### Expected behavior

 model returns correct results

### actual behavior

Server crashes when using streaming or stopping early with end id or using decoding mode with top_k or top_p

**1.  Stopping early with end id**
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: 0 <= acceptedTokensLen && acceptedTokensLen <= nextDraftTokensLen (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:1424)
1       0x7f40905b2a60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2       0x7f4034912362 /data01/kilm/users/chiendb/projects/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0xd9362) [0x7f4034912362]
3       0x7f4034abcdb4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
4       0x7f4034ac4ee4 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404
5       0x7f40ab1f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f40ab1f1253]
6       0x7f40aaf80ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f40aaf80ac3]
7       0x7f40ab011a04 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 63083907: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: 0 <= acceptedTokensLen && acceptedTokensLen <= nextDraftTokensLen (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:1424)
1       0x7f40905b2a60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2       0x7f4034912362 /data01/kilm/users/chiendb/projects/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0xd9362) [0x7f4034912362]
3       0x7f4034abcdb4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
4       0x7f4034ac4ee4 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404
5       0x7f40ab1f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f40ab1f1253]
6       0x7f40aaf80ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f40aaf80ac3]
7       0x7f40ab011a04 clone + 68
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: 0 <= acceptedTokensLen && acceptedTokensLen <= nextDraftTokensLen (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:1424)
1       0x7f8bfc4e3a60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2       0x7f8ba0912362 /projects/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0xd9362) [0x7f8ba0912362]
3       0x7f8ba0abcdb4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
4       0x7f8ba0ac4ee4 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404
5       0x7f8c0fff1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8c0fff1253]
6       0x7f8c0fd80ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8c0fd80ac3]
7       0x7f8c0fe11a04 clone + 68

**2. Using streaming**
I0415 02:59:52.989804 11244 stream_infer_handler.cc:155] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step WRITTEN
I0415 02:59:52.989807 11244 infer_handler.h:1305] Returning from ModelStreamInferHandler, 0, ISSUED
terminate called after throwing an instance of 'tensorrt_llm::common::TllmExceptionterminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
'
  what():  [TensorRT-LLM][ERROR] Assertion failed: newSize <= getCapacity() (/projects/TensorRT-LLM/cpp/tensorrt_llm/runtime/bufferView.h:83)
1       0x7f083839ba60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2       0x7f07dca3730e virtual thunk to tensorrt_llm::runtime::TensorView::reshape(nvinfer1::Dims32 const&) + 366
3       0x7f07dca382a3 virtual thunk to tensorrt_llm::runtime::TensorView::resize(unsigned long) + 147
4       0x7f07dcabd2e1 tensorrt_llm::batch_manager::GptManager::returnCompletedRequests() + 1297
5       0x7f07dcac4f11 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 449
6       0x7f084adf1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f084adf1253]
7       0x7f084ab80ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f084ab80ac3]
8       0x7f084ac11a04 clone + 68  what():  [TensorRT-LLM][ERROR] Assertion failed: newSize <= getCapacity() (/projects/TensorRT-LLM/cpp/tensorrt_llm/runtime/bufferView.h:83)
1       0x7f0ce6459a60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2       0x7f0c84a3730e virtual thunk to tensorrt_llm::runtime::TensorView::reshape(nvinfer1::Dims32 const&) + 366
3       0x7f0c84a382a3 virtual thunk to tensorrt_llm::runtime::TensorView::resize(unsigned long) + 147
4       0x7f0c84abd2e1 tensorrt_llm::batch_manager::GptManager::returnCompletedRequests() + 1297
5       0x7f0c84ac4f11 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 449
6       0x7f0cfa7f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f0cfa7f1253]
7       0x7f0cfa580ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f0cfa580ac3]
8       0x7f0cfa611a04 clone + 68

Signal (6) received.
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

**3. Using decoding mode with top_k or top_p**
Received an error from server:
in ensemble 'ensemble', Failed to process the request(s) for model instance 'postprocessing_0_0', message: TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'

### additional notes

When I use script https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/run.py , model returns correct results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail to run Medusa IFB with triton inference server #1449

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fail to run Medusa IFB with triton inference server #1449

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions