Skip to content

Code Llama 70b triton crashes with XQA #1256

@phind-justin

Description

@phind-justin

System Info

x86
8x h100 80g
v0.8.0

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

python convert_checkpoint.py --model_dir /mnt/nfsfast/codellama-70b\
                            --output_dir /mnt/nfsfast/models-0.8.0/codellama-70b-base \
                            --dtype bfloat16 \
                            --rotary_base 1000000 \
                            --vocab_size 32016 \
                            --tp_size 8 \
                            --use_parallel_embedding


trtllm-build --checkpoint_dir /mnt/nfsfast/models-0.8.0/codellama-70b
   --output_dir /mnt/nfsfast/models-0.8.0/trt_engines/codellama-70b/8-gpu/
   --gemm_plugin bfloat16
   --max_input_len 13848
   --max_output_len 2536
   --max_batch_size 32
   --workers 8
   --remove_input_padding enable
   --gpt_attention_plugin bfloat16
   --context_fmha enable
   --paged_kv_cache enable
   --multi_block_mode enable
   --use_custom_all_reduce enable
   --bert_attention_plugin disable
   --enable_xqa enable

I then use triton to serve the model and send a few inferences

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 mpirun -n 8 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=...

it seems to work for a few inferences, especially when there's only 1-2 at a time. But after a few inferences, or if we send 4 at a time, it doesn't work.

Expected behavior

it runs successfully and does not crash for batch sizes up to 32!

actual behavior

It crashes with an XQA error after a few inferences

codellama_70b_xqa.log

additional notes

.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions