Skip to content

[Bug]: Incorrect KV cache retrieval when host/seconary cache is used #8274

@josephrocca

Description

@josephrocca

Edit: This issue originally contained what I thought was a solution to this bug (see edit drop-down in top right), but it turned out that I mistakenly had host_cache_size commented out when testing the changes that "fixed" the issue. So now I'm unsure what is causing it.


System Info

4xB200 using nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0

Reproduction

This is my config/command:

cat >/root/trtllm-config.yml<<EOF
stream_interval: 2
cuda_graph_config: null
enable_attention_dp: false
kv_cache_config:
  enable_partial_reuse: false
  enable_block_reuse: true
  dtype: fp8
  free_gpu_memory_fraction: 0.85
  host_cache_size: $(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits --id=0 | awk '{print int($1 * 1024 * 1024 * 0.8)}')
EOF
PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync CUDA_VISIBLE_DEVICES='0,1,2,3' trtllm-serve /root/data/nvidia___DeepSeek-R1-0528-FP4 --backend pytorch --tp_size 4 --trust_remote_code --extra_llm_api_options /root/trtllm-config.yml

Then I hammer the server for about 10k requests of a chat dataset at ~300 concurrency, and then run a benchmark, and the accuracy is much lower, and manually inspecting the failed answers shows that the wrong context is being retrieved (i.e. it's answering the wrong questions) for prefix caching.

If I remove host_cache_size, then there are no issues with the benchmark even after 100k requests.

Metadata

Metadata

Assignees

Labels

KV-Cache Managementkv-cache management for efficient LLM inferencebugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions