[RFC]: Implement disaggregated prefilling using Mooncake

### Motivation.

Disaggregated prefilling/decoding is expected to achieve better performance (e.g., long documents) in LLM inference. [#5557](https://github.com/vllm-project/vllm/issues/5557) proposes a good paradigm. 

In addition, the Transfer Engine of [Mooncake](https://github.com/kvcache-ai/mooncake), which is a KVCache-centric disaggregated architecture for LLM serving, is open-sourced. 

Compared with NCCL, Mooncake Transfer Engine has the following features:
- a unified programming interface for data transfers between DRAM-to-DRAM (both local and remote), DRAM-to-GPU VRAM (both local and remote), and DRAM-to-remote NVMe devices
- support for TCP, RDMA, and NVMe-of protocols
- topology-aware path selection (link to our english doc, transfer_engine.md), aggregating bandwidth from multiple NICs

### Proposed Change.

The plan is to integrate vLLM with Mooncake. Initially we have implemented a prototype that replaces nccl with Transfer Engine in the data plane. In the future, we are planning to develop Mooncake Store to fully support disaggregated prefilling (M prefill & N decode) and make it ready for production. Mooncake's architecture is [here](https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/architecture.md).

Feel free to use our prototype and comment about our design!

### Feedback Period.

Several weeks



### CC List.

@ShangmingCai @stmatengss  @james0zan 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Implement disaggregated prefilling using Mooncake #10727

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Implement disaggregated prefilling using Mooncake #10727

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions