-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Description
Motivation.
Disaggregated prefilling/decoding is expected to achieve better performance (e.g., long documents) in LLM inference. #5557 proposes a good paradigm.
In addition, the Transfer Engine of Mooncake, which is a KVCache-centric disaggregated architecture for LLM serving, is open-sourced.
Compared with NCCL, Mooncake Transfer Engine has the following features:
- a unified programming interface for data transfers between DRAM-to-DRAM (both local and remote), DRAM-to-GPU VRAM (both local and remote), and DRAM-to-remote NVMe devices
- support for TCP, RDMA, and NVMe-of protocols
- topology-aware path selection (link to our english doc, transfer_engine.md), aggregating bandwidth from multiple NICs
Proposed Change.
The plan is to integrate vLLM with Mooncake. Initially we have implemented a prototype that replaces nccl with Transfer Engine in the data plane. In the future, we are planning to develop Mooncake Store to fully support disaggregated prefilling (M prefill & N decode) and make it ready for production. Mooncake's architecture is here.
Feel free to use our prototype and comment about our design!
Feedback Period.
Several weeks
CC List.
@ShangmingCai @stmatengss @james0zan
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.