-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
KV-Cache Managementkv-cache management for efficient LLM inferencekv-cache management for efficient LLM inferencebugSomething isn't workingSomething isn't working
Description
Edit: This issue originally contained what I thought was a solution to this bug (see edit drop-down in top right), but it turned out that I mistakenly had host_cache_size
commented out when testing the changes that "fixed" the issue. So now I'm unsure what is causing it.
System Info
4xB200 using nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0
Reproduction
This is my config/command:
cat >/root/trtllm-config.yml<<EOF
stream_interval: 2
cuda_graph_config: null
enable_attention_dp: false
kv_cache_config:
enable_partial_reuse: false
enable_block_reuse: true
dtype: fp8
free_gpu_memory_fraction: 0.85
host_cache_size: $(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits --id=0 | awk '{print int($1 * 1024 * 1024 * 0.8)}')
EOF
PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync CUDA_VISIBLE_DEVICES='0,1,2,3' trtllm-serve /root/data/nvidia___DeepSeek-R1-0528-FP4 --backend pytorch --tp_size 4 --trust_remote_code --extra_llm_api_options /root/trtllm-config.yml
Then I hammer the server for about 10k requests of a chat dataset at ~300 concurrency, and then run a benchmark, and the accuracy is much lower, and manually inspecting the failed answers shows that the wrong context is being retrieved (i.e. it's answering the wrong questions) for prefix caching.
If I remove host_cache_size
, then there are no issues with the benchmark even after 100k requests.
Metadata
Metadata
Assignees
Labels
KV-Cache Managementkv-cache management for efficient LLM inferencekv-cache management for efficient LLM inferencebugSomething isn't workingSomething isn't working