Code Llama 70b triton crashes with XQA

### System Info

x86
8x h100 80g
v0.8.0

### Who can help?

@Tracin 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```
python convert_checkpoint.py --model_dir /mnt/nfsfast/codellama-70b\
                            --output_dir /mnt/nfsfast/models-0.8.0/codellama-70b-base \
                            --dtype bfloat16 \
                            --rotary_base 1000000 \
                            --vocab_size 32016 \
                            --tp_size 8 \
                            --use_parallel_embedding


trtllm-build --checkpoint_dir /mnt/nfsfast/models-0.8.0/codellama-70b
   --output_dir /mnt/nfsfast/models-0.8.0/trt_engines/codellama-70b/8-gpu/
   --gemm_plugin bfloat16
   --max_input_len 13848
   --max_output_len 2536
   --max_batch_size 32
   --workers 8
   --remove_input_padding enable
   --gpt_attention_plugin bfloat16
   --context_fmha enable
   --paged_kv_cache enable
   --multi_block_mode enable
   --use_custom_all_reduce enable
   --bert_attention_plugin disable
   --enable_xqa enable
```

I then use triton to serve the model and send a few inferences
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 mpirun -n 8 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=...
```

it seems to work for a few inferences, especially when there's only 1-2 at a time. But after a few inferences, or if we send 4 at a time, it doesn't work.

### Expected behavior

it runs successfully and does not crash for batch sizes up to 32!

### actual behavior

It crashes with an XQA error after a few inferences

[codellama_70b_xqa.log](https://github.com/NVIDIA/TensorRT-LLM/files/14533661/codellama_70b_xqa.log)


### additional notes

.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code Llama 70b triton crashes with XQA #1256

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Code Llama 70b triton crashes with XQA #1256

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions