llm-d
diff --git a/‎blog/2025-05-07_welcome_llmd.md‎
Lines changed: 0 additions & 14 deletions b/‎blog/2025-05-07_welcome_llmd.md‎
Lines changed: 0 additions & 14 deletions
diff --git a/‎blog/2025-05-19.md‎
Lines changed: 0 additions & 13 deletions b/‎blog/2025-05-19.md‎
Lines changed: 0 additions & 13 deletions
diff --git a/‎blog/authors.yml‎
Lines changed: 22 additions & 18 deletions b/‎blog/authors.yml‎
Lines changed: 22 additions & 18 deletions
diff --git a/‎blog/tags.yml‎
Lines changed: 8 additions & 2 deletions b/‎blog/tags.yml‎
Lines changed: 8 additions & 2 deletions
diff --git a/‎docs/architecture/00_architecture.md‎
Lines changed: 3 additions & 2 deletions b/‎docs/architecture/00_architecture.md‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docs/architecture/Component Architecture/02_inf-simulator.md‎
Lines changed: 120 additions & 0 deletions b/‎docs/architecture/Component Architecture/02_inf-simulator.md‎
Lines changed: 120 additions & 0 deletions
diff --git a/‎docs/assets/images/carlos costa.jpeg‎
3.59 KB b/‎docs/assets/images/carlos costa.jpeg‎
3.59 KB
diff --git a/‎docs/assets/images/clayton coleman.jpeg‎
41.7 KB b/‎docs/assets/images/clayton coleman.jpeg‎
41.7 KB
diff --git a/‎docs/assets/images/llm-d-arch-simplified.svg‎
Lines changed: 1 addition & 0 deletions b/‎docs/assets/images/llm-d-arch-simplified.svg‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/assets/images/robert shaw headshot.jpeg‎
2.84 KB b/‎docs/assets/images/robert shaw headshot.jpeg‎
2.84 KB
@@ -1,22 +1,26 @@
-Huey:
-  name: Huw
-  title: The Nephew in Red
-
-Dewey:
-  name: Dewydd
-  title: The one in Blue
+redhat:
+  name: RedHat
+  url: https://redhat.com
+  image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg
 
 
-Louie:
-  name: Lewellyn
-  title: That one in green
+robshaw:
+  name: Robert Shaw
+  title: Director of Engineering, Red Hat
+  url: https://github.com/robertgshaw2-redhat
+  image_url: https://avatars.githubusercontent.com/u/114415538?v=4
+  email: [email protected]
 
-kahuna:
-  name: Big kahuna
-  title: The one in charge
+smarterclayton:
+  name: Clayton Coleman
+  title: Distinguished Engineer, Google
+  url: https://github.com/smarterclayton
+  image_url: https://avatars.githubusercontent.com/u/1163175?v=4
+  email: [email protected]
 
-redhat-author:
-  name: RedHat
-  title: One of the sponsors
-  url: https://redhat.com
-  image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg
+chcost:
+  name: Carlos Costa
+  title: Distinguished Engineer, IBM
+  url: https://github.com/chcost
+  image_url: https://avatars.githubusercontent.com/u/26551701?v=4
+  email: [email protected]
@@ -19,7 +19,7 @@ llm-d:
   description: llm-d tag description
 
 news:
-  label: News Releases!
+  label: News Releases
   permalink: /news-releases
   description: Used for "official" news releases in the blog
 
@@ -34,6 +34,12 @@ hola:
   description: Hola tag description
 
 blog:
-  label: just a blog
+  label: blog posts
   permalink: /blog
   description: everyday blog posts
+
+  
+announce:
+  label: Announcements
+  permalink: /announce
+  description: Announcements that aren't news releases
@@ -3,7 +3,7 @@ sidebar_position: 0
 label: llm-d Architecture
 ---
 # Overview of llm-d architecture
-`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
+`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
 
 With `llm-d`, users can operationalize GenAI deployments with a modular solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in [Inference Gateway (IGW)](https://github.com/kubernetes-sigs/gateway-api-inference-extension).
 
@@ -14,7 +14,7 @@ Built by leaders in the Kubernetes and vLLM projects, `llm-d` is a community-dri
 `llm-d` adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway.
 
 
-![llm-d Architecture](../assets/images/llm-d-arch.svg)
+![llm-d Architecture](../assets/images/llm-d-arch-simplified.svg)
 
 
 
@@ -31,6 +31,7 @@ Key features of `llm-d` include:
 - **Variant Autoscaling over Hardware, Workload, and Traffic** (🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes)
 Using the recent traffic mix to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency. [See our Northstar design](https://docs.google.com/document/d/1inTneLEZTv3rDEBB9KLOB9K6oMq8c3jkogARJqdt_58/edit?tab=t.0)
 
+For more, see the [project proposal](https://github.com/llm-d/llm-d/blob/dev/docs/proposals/llm-d.md)
 
 ## Getting Started
 
 
@@ -0,0 +1,120 @@
+---
+sidebar_position: 2
+sidebar_label: Inference Simulator
+---
+# vLLM Simulator
+To help with development and testing we have developed a light weight vLLM simulator. It does not truly
+run inference, but it does emulate responses to the HTTP REST endpoints of vLLM. 
+Currently it supports partial OpenAI-compatible API:
+- /v1/chat/completions 
+- /v1/completions 
+- /v1/models
+
+In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics:
+- vllm:lora_requests_info
+
+The simulated inferense has no connection with the model and LoRA adapters specified in the command line parameters. The /v1/models endpoint returns simulated results based on those same command line parameters.
+
+The simulator supports two modes of operation:
+- `echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used.
+- `random` mode: the response is randomly chosen from a set of pre-defined sentences.
+
+Timing of the response is defined by two parameters: `time-to-first-token` and `inter-token-latency`. 
+
+For a request with `stream=true`: `time-to-first-token` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream. 
+
+For a requst with `stream=false`: the response is returned after delay of `<time-to-first-token> + (<inter-token-latency> * (<number_of_output_tokens> - 1))`
+
+It can be run standalone or in a Pod for testing under packages such as Kind.
+
+## Limitations
+API responses contains a subset of the fields provided by the OpenAI API.
+
+<details>
+  <summary>Click to show the structure of requests/responses</summary>  
+
+- `/v1/chat/completions`
+    - **request**
+        - stream
+        - model
+        - messages
+            - role
+            - content
+    - **response**
+        - id
+        - created
+        - model
+        - choices
+            - index
+            - finish_reason
+            - message
+- `/v1/completions`
+    - **request**
+        - stream
+        - model
+        - prompt
+        - max_tokens (for future usage)
+    - **response**
+        - id
+        - created
+        - model
+        - choices
+            - text
+- `/v1/models`
+    - **response**
+        - object (list)
+        - data
+            - id
+            - object (model)
+            - created
+            - owned_by
+            - root
+            - parent
+</details>
+<br/>
+For more details see the [vLLM documentation](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#openai-completions-api-with-vllm)
+
+## Command line parameters
+- `port`: the port the simulator listents on, mandatory
+- `model`: the currently 'loaded' model, mandatory
+- `lora`: a list of available LoRA adapters, separated by commas, optional, by default empty
+- `mode`: the simulator mode, optional, by default `random`
+ - `echo`: returns the same text that was sent in the request
+ - `random`: returns a sentence chosen at random from a set of pre-defined sentences
+- `time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero
+- `inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero
+- `max-loras`: maximum number of LoRAs in a single batch, optional, default is one
+- `max-cpu-loras`: maximum number of LoRAs to store in CPU memory, optional, must be >= than max_loras, default is max_loras
+- `max-running-requests`: maximum number of inference requests that could be processed at the same time
+
+
+## Working with docker image
+
+### Building
+To build a Docker image of the vLLM Simulator, run:
+```bash
+make build-llm-d-inference-sim-image
+```
+
+### Running
+To run the vLLM Simulator image under Docker, run:
+```bash
+docker run --rm --publish 8000:8000 ai-aware-router/llm-d-inference-sim:0.0.1 /ai-aware-router/llm-d-inference-sim  --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1"
+```
+**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.
+
+## Standalone testing
+
+### Building
+To build the vLLM simulator, run:
+```bash
+make build-llm-d-inference-sim
+```
+
+### Running
+To run the router in a standalone test environment, run:
+```bash
+./bin/llm-d-inference-sim --model my_model --port 8000
+```
+
+