-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
bugSomething isn't workingSomething isn't workingtriagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Description
System Info
x86
8x h100 80g
v0.8.0
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
python convert_checkpoint.py --model_dir /mnt/nfsfast/codellama-70b\
--output_dir /mnt/nfsfast/models-0.8.0/codellama-70b-base \
--dtype bfloat16 \
--rotary_base 1000000 \
--vocab_size 32016 \
--tp_size 8 \
--use_parallel_embedding
trtllm-build --checkpoint_dir /mnt/nfsfast/models-0.8.0/codellama-70b
--output_dir /mnt/nfsfast/models-0.8.0/trt_engines/codellama-70b/8-gpu/
--gemm_plugin bfloat16
--max_input_len 13848
--max_output_len 2536
--max_batch_size 32
--workers 8
--remove_input_padding enable
--gpt_attention_plugin bfloat16
--context_fmha enable
--paged_kv_cache enable
--multi_block_mode enable
--use_custom_all_reduce enable
--bert_attention_plugin disable
--enable_xqa enable
I then use triton to serve the model and send a few inferences
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 mpirun -n 8 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=...
it seems to work for a few inferences, especially when there's only 1-2 at a time. But after a few inferences, or if we send 4 at a time, it doesn't work.
Expected behavior
it runs successfully and does not crash for batch sizes up to 32!
actual behavior
It crashes with an XQA error after a few inferences
additional notes
.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingtriagedIssue has been triaged by maintainersIssue has been triaged by maintainers