Skip to content

Commit 2c95e06

Browse files
authored
Dev kpr -- Updating the Quickstart instructions and fixing links (#15)
1 parent 29eb71d commit 2c95e06

File tree

13 files changed

+323
-218
lines changed

13 files changed

+323
-218
lines changed

blog/2025-05-07_welcome_llmd.md

Lines changed: 0 additions & 14 deletions
This file was deleted.

blog/2025-05-19.md

Lines changed: 0 additions & 13 deletions
This file was deleted.

blog/authors.yml

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,26 @@
1-
Huey:
2-
name: Huw
3-
title: The Nephew in Red
4-
5-
Dewey:
6-
name: Dewydd
7-
title: The one in Blue
1+
redhat:
2+
name: RedHat
3+
url: https://redhat.com
4+
image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg
85

96

10-
Louie:
11-
name: Lewellyn
12-
title: That one in green
7+
robshaw:
8+
name: Robert Shaw
9+
title: Director of Engineering, Red Hat
10+
url: https://github.com/robertgshaw2-redhat
11+
image_url: https://avatars.githubusercontent.com/u/114415538?v=4
12+
1313

14-
kahuna:
15-
name: Big kahuna
16-
title: The one in charge
14+
smarterclayton:
15+
name: Clayton Coleman
16+
title: Distinguished Engineer, Google
17+
url: https://github.com/smarterclayton
18+
image_url: https://avatars.githubusercontent.com/u/1163175?v=4
19+
1720

18-
redhat-author:
19-
name: RedHat
20-
title: One of the sponsors
21-
url: https://redhat.com
22-
image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg
21+
chcost:
22+
name: Carlos Costa
23+
title: Distinguished Engineer, IBM
24+
url: https://github.com/chcost
25+
image_url: https://avatars.githubusercontent.com/u/26551701?v=4
26+

blog/tags.yml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ llm-d:
1919
description: llm-d tag description
2020

2121
news:
22-
label: News Releases!
22+
label: News Releases
2323
permalink: /news-releases
2424
description: Used for "official" news releases in the blog
2525

@@ -34,6 +34,12 @@ hola:
3434
description: Hola tag description
3535

3636
blog:
37-
label: just a blog
37+
label: blog posts
3838
permalink: /blog
3939
description: everyday blog posts
40+
41+
42+
announce:
43+
label: Announcements
44+
permalink: /announce
45+
description: Announcements that aren't news releases

docs/architecture/00_architecture.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ sidebar_position: 0
33
label: llm-d Architecture
44
---
55
# Overview of llm-d architecture
6-
`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
6+
`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
77

88
With `llm-d`, users can operationalize GenAI deployments with a modular solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in [Inference Gateway (IGW)](https://github.com/kubernetes-sigs/gateway-api-inference-extension).
99

@@ -14,7 +14,7 @@ Built by leaders in the Kubernetes and vLLM projects, `llm-d` is a community-dri
1414
`llm-d` adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway.
1515

1616

17-
![llm-d Architecture](../assets/images/llm-d-arch.svg)
17+
![llm-d Architecture](../assets/images/llm-d-arch-simplified.svg)
1818

1919

2020

@@ -31,6 +31,7 @@ Key features of `llm-d` include:
3131
- **Variant Autoscaling over Hardware, Workload, and Traffic** (🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes)
3232
Using the recent traffic mix to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency. [See our Northstar design](https://docs.google.com/document/d/1inTneLEZTv3rDEBB9KLOB9K6oMq8c3jkogARJqdt_58/edit?tab=t.0)
3333

34+
For more, see the [project proposal](https://github.com/llm-d/llm-d/blob/dev/docs/proposals/llm-d.md)
3435

3536
## Getting Started
3637

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
sidebar_position: 2
3+
sidebar_label: Inference Simulator
4+
---
5+
# vLLM Simulator
6+
To help with development and testing we have developed a light weight vLLM simulator. It does not truly
7+
run inference, but it does emulate responses to the HTTP REST endpoints of vLLM.
8+
Currently it supports partial OpenAI-compatible API:
9+
- /v1/chat/completions
10+
- /v1/completions
11+
- /v1/models
12+
13+
In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics:
14+
- vllm:lora_requests_info
15+
16+
The simulated inferense has no connection with the model and LoRA adapters specified in the command line parameters. The /v1/models endpoint returns simulated results based on those same command line parameters.
17+
18+
The simulator supports two modes of operation:
19+
- `echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used.
20+
- `random` mode: the response is randomly chosen from a set of pre-defined sentences.
21+
22+
Timing of the response is defined by two parameters: `time-to-first-token` and `inter-token-latency`.
23+
24+
For a request with `stream=true`: `time-to-first-token` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream.
25+
26+
For a requst with `stream=false`: the response is returned after delay of `<time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))`
27+
28+
It can be run standalone or in a Pod for testing under packages such as Kind.
29+
30+
## Limitations
31+
API responses contains a subset of the fields provided by the OpenAI API.
32+
33+
<details>
34+
<summary>Click to show the structure of requests/responses</summary>
35+
36+
- `/v1/chat/completions`
37+
- **request**
38+
- stream
39+
- model
40+
- messages
41+
- role
42+
- content
43+
- **response**
44+
- id
45+
- created
46+
- model
47+
- choices
48+
- index
49+
- finish_reason
50+
- message
51+
- `/v1/completions`
52+
- **request**
53+
- stream
54+
- model
55+
- prompt
56+
- max_tokens (for future usage)
57+
- **response**
58+
- id
59+
- created
60+
- model
61+
- choices
62+
- text
63+
- `/v1/models`
64+
- **response**
65+
- object (list)
66+
- data
67+
- id
68+
- object (model)
69+
- created
70+
- owned_by
71+
- root
72+
- parent
73+
</details>
74+
<br/>
75+
For more details see the [vLLM documentation](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#openai-completions-api-with-vllm)
76+
77+
## Command line parameters
78+
- `port`: the port the simulator listents on, mandatory
79+
- `model`: the currently 'loaded' model, mandatory
80+
- `lora`: a list of available LoRA adapters, separated by commas, optional, by default empty
81+
- `mode`: the simulator mode, optional, by default `random`
82+
- `echo`: returns the same text that was sent in the request
83+
- `random`: returns a sentence chosen at random from a set of pre-defined sentences
84+
- `time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero
85+
- `inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero
86+
- `max-loras`: maximum number of LoRAs in a single batch, optional, default is one
87+
- `max-cpu-loras`: maximum number of LoRAs to store in CPU memory, optional, must be >= than max_loras, default is max_loras
88+
- `max-running-requests`: maximum number of inference requests that could be processed at the same time
89+
90+
91+
## Working with docker image
92+
93+
### Building
94+
To build a Docker image of the vLLM Simulator, run:
95+
```bash
96+
make build-llm-d-inference-sim-image
97+
```
98+
99+
### Running
100+
To run the vLLM Simulator image under Docker, run:
101+
```bash
102+
docker run --rm --publish 8000:8000 ai-aware-router/llm-d-inference-sim:0.0.1 /ai-aware-router/llm-d-inference-sim --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1"
103+
```
104+
**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
105+
106+
## Standalone testing
107+
108+
### Building
109+
To build the vLLM simulator, run:
110+
```bash
111+
make build-llm-d-inference-sim
112+
```
113+
114+
### Running
115+
To run the router in a standalone test environment, run:
116+
```bash
117+
./bin/llm-d-inference-sim --model my_model --port 8000
118+
```
119+
120+
3.59 KB
Loading
41.7 KB
Loading

docs/assets/images/llm-d-arch-simplified.svg

Lines changed: 1 addition & 0 deletions
Loading
2.84 KB
Loading

0 commit comments

Comments
 (0)