|
| 1 | +--- |
| 2 | +sidebar_position: 2 |
| 3 | +sidebar_label: Inference Simulator |
| 4 | +--- |
| 5 | +# vLLM Simulator |
| 6 | +To help with development and testing we have developed a light weight vLLM simulator. It does not truly |
| 7 | +run inference, but it does emulate responses to the HTTP REST endpoints of vLLM. |
| 8 | +Currently it supports partial OpenAI-compatible API: |
| 9 | +- /v1/chat/completions |
| 10 | +- /v1/completions |
| 11 | +- /v1/models |
| 12 | + |
| 13 | +In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics: |
| 14 | +- vllm:lora_requests_info |
| 15 | + |
| 16 | +The simulated inferense has no connection with the model and LoRA adapters specified in the command line parameters. The /v1/models endpoint returns simulated results based on those same command line parameters. |
| 17 | + |
| 18 | +The simulator supports two modes of operation: |
| 19 | +- `echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used. |
| 20 | +- `random` mode: the response is randomly chosen from a set of pre-defined sentences. |
| 21 | + |
| 22 | +Timing of the response is defined by two parameters: `time-to-first-token` and `inter-token-latency`. |
| 23 | + |
| 24 | +For a request with `stream=true`: `time-to-first-token` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream. |
| 25 | + |
| 26 | +For a requst with `stream=false`: the response is returned after delay of `<time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))` |
| 27 | + |
| 28 | +It can be run standalone or in a Pod for testing under packages such as Kind. |
| 29 | + |
| 30 | +## Limitations |
| 31 | +API responses contains a subset of the fields provided by the OpenAI API. |
| 32 | + |
| 33 | +<details> |
| 34 | + <summary>Click to show the structure of requests/responses</summary> |
| 35 | + |
| 36 | +- `/v1/chat/completions` |
| 37 | + - **request** |
| 38 | + - stream |
| 39 | + - model |
| 40 | + - messages |
| 41 | + - role |
| 42 | + - content |
| 43 | + - **response** |
| 44 | + - id |
| 45 | + - created |
| 46 | + - model |
| 47 | + - choices |
| 48 | + - index |
| 49 | + - finish_reason |
| 50 | + - message |
| 51 | +- `/v1/completions` |
| 52 | + - **request** |
| 53 | + - stream |
| 54 | + - model |
| 55 | + - prompt |
| 56 | + - max_tokens (for future usage) |
| 57 | + - **response** |
| 58 | + - id |
| 59 | + - created |
| 60 | + - model |
| 61 | + - choices |
| 62 | + - text |
| 63 | +- `/v1/models` |
| 64 | + - **response** |
| 65 | + - object (list) |
| 66 | + - data |
| 67 | + - id |
| 68 | + - object (model) |
| 69 | + - created |
| 70 | + - owned_by |
| 71 | + - root |
| 72 | + - parent |
| 73 | +</details> |
| 74 | +<br/> |
| 75 | +For more details see the [vLLM documentation](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#openai-completions-api-with-vllm) |
| 76 | + |
| 77 | +## Command line parameters |
| 78 | +- `port`: the port the simulator listents on, mandatory |
| 79 | +- `model`: the currently 'loaded' model, mandatory |
| 80 | +- `lora`: a list of available LoRA adapters, separated by commas, optional, by default empty |
| 81 | +- `mode`: the simulator mode, optional, by default `random` |
| 82 | + - `echo`: returns the same text that was sent in the request |
| 83 | + - `random`: returns a sentence chosen at random from a set of pre-defined sentences |
| 84 | +- `time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero |
| 85 | +- `inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero |
| 86 | +- `max-loras`: maximum number of LoRAs in a single batch, optional, default is one |
| 87 | +- `max-cpu-loras`: maximum number of LoRAs to store in CPU memory, optional, must be >= than max_loras, default is max_loras |
| 88 | +- `max-running-requests`: maximum number of inference requests that could be processed at the same time |
| 89 | + |
| 90 | + |
| 91 | +## Working with docker image |
| 92 | + |
| 93 | +### Building |
| 94 | +To build a Docker image of the vLLM Simulator, run: |
| 95 | +```bash |
| 96 | +make build-llm-d-inference-sim-image |
| 97 | +``` |
| 98 | + |
| 99 | +### Running |
| 100 | +To run the vLLM Simulator image under Docker, run: |
| 101 | +```bash |
| 102 | +docker run --rm --publish 8000:8000 ai-aware-router/llm-d-inference-sim:0.0.1 /ai-aware-router/llm-d-inference-sim --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1" |
| 103 | +``` |
| 104 | +**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model. |
| 105 | + |
| 106 | +## Standalone testing |
| 107 | + |
| 108 | +### Building |
| 109 | +To build the vLLM simulator, run: |
| 110 | +```bash |
| 111 | +make build-llm-d-inference-sim |
| 112 | +``` |
| 113 | + |
| 114 | +### Running |
| 115 | +To run the router in a standalone test environment, run: |
| 116 | +```bash |
| 117 | +./bin/llm-d-inference-sim --model my_model --port 8000 |
| 118 | +``` |
| 119 | + |
| 120 | + |
0 commit comments