[Bug]: Incorrect KV cache retrieval when host/seconary cache is used

**Edit**: This issue originally contained what I thought was a solution to this bug (see edit drop-down in top right), but it turned out that I mistakenly had `host_cache_size` commented out when testing the changes that "fixed" the issue. So now I'm unsure what is causing it.

---

### System Info

4xB200 using `nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0`

### Reproduction

This is my config/command:

```sh
cat >/root/trtllm-config.yml<<EOF
stream_interval: 2
cuda_graph_config: null
enable_attention_dp: false
kv_cache_config:
  enable_partial_reuse: false
  enable_block_reuse: true
  dtype: fp8
  free_gpu_memory_fraction: 0.85
  host_cache_size: $(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits --id=0 | awk '{print int($1 * 1024 * 1024 * 0.8)}')
EOF
PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync CUDA_VISIBLE_DEVICES='0,1,2,3' trtllm-serve /root/data/nvidia___DeepSeek-R1-0528-FP4 --backend pytorch --tp_size 4 --trust_remote_code --extra_llm_api_options /root/trtllm-config.yml
```

Then I hammer the server for about 10k requests of a chat dataset at ~300 concurrency, and then run a benchmark, and the accuracy is much lower, and manually inspecting the failed answers shows that the wrong context is being retrieved (i.e. it's answering the wrong questions) for prefix caching.

If I remove `host_cache_size`, then there are no issues with the benchmark even after 100k requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Incorrect KV cache retrieval when host/seconary cache is used #8274

System Info

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Incorrect KV cache retrieval when host/seconary cache is used #8274

Description

System Info

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions