You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
server : Support multimodal completion and embeddings prompts in JSON format
- Use server_tokens in more places in server and util.cpp
- Convert most functions that used llama_tokens to server_tokens
- Modify input tokenizer to handle JSON objects as subprompts
- Break out MTMD prompt parsing into utility function
- Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types
- Add capability to model endpoint to indicate if client can send multimodal data
- Add tests
Copy file name to clipboardExpand all lines: tools/server/README.md
+11-6Lines changed: 11 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -226,6 +226,10 @@ services:
226
226
### Multimodal support
227
227
228
228
Multimodal support was added in [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) and is currently an experimental feature.
229
+
It is currently available in the following endpoints:
230
+
- The OAI-compatible chat endpoint.
231
+
- The non-OAI-compatible completions endpoint.
232
+
- The non-OAI-compatible embeddings endpoint.
229
233
230
234
For more details, please refer to [multimodal documentation](../../docs/multimodal.md)
231
235
@@ -400,12 +404,15 @@ These input shapes and data type are allowed for `prompt`:
400
404
- Single string: `"string"`
401
405
- Single sequence of tokens: `[12, 34, 56]`
402
406
- Mixed tokens and strings: `[12, 34, "string", 56, 78]`
407
+
- A JSON object which optionally contains multimodal data: `{ "prompt_string": "string", "multimodal_data": ["base64"] }`
403
408
404
409
Multiple prompts are also supported. In this case, the completion result will be an array.
405
410
406
411
- Only strings: `["string1", "string2"]`
407
-
- Strings and sequences of tokens: `["string1", [12, 34, 56]]`
Note for `multimodal_data` in JSON object prompts. This should be an array of strings, containing base64 encoded multimodal data such as images and audio. There must be an identical number of MTMD media markers in the string prompt element which act as placeholders for the data provided to this parameter. The multimodal data files will be substituted in order. The marker string (e.g. `<__media__>`) can be found by calling `mtmd_default_marker()` defined in [the MTMD C API](https://github.com/ggml-org/llama.cpp/blob/5fd160bbd9d70b94b5b11b0001fd7f477005e4a0/tools/mtmd/mtmd.h#L87). A client *must not* specify this field unless the server has the multimodal capability. Clients should check `/models` or `/v1/models` for the `multimodal` capability before a multimodal request.
409
416
410
417
`temperature`: Adjust the randomness of the generated text. Default: `0.8`
411
418
@@ -477,8 +484,6 @@ These words will not be included in the completion, so make sure to add them to
477
484
478
485
`t_max_predict_ms`: Set a time limit in milliseconds for the prediction (a.k.a. text-generation) phase. The timeout will trigger if the generation takes more than the specified time (measured since the first token was generated) and if a new-line character has already been generated. Useful for FIM applications. Default: `0`, which is disabled.
479
486
480
-
`image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `prompt`. You can determine the place of the image in the prompt as in the following: `USER:[img-12]Describe the image in detail.\nASSISTANT:`. In this case, `[img-12]` will be replaced by the embeddings of the image with id `12` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 12}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.
481
-
482
487
`id_slot`: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot. Default: `-1`
483
488
484
489
`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `true`
@@ -638,12 +643,12 @@ Returns a JSON object with a field `prompt` containing a string of the input mes
638
643
639
644
The same as [the embedding example](../embedding) does.
640
645
646
+
This endpoint also supports multimodal embeddings. See the documentation for [completions prompts](../completions) for details on how to send a multimodal prompt.
647
+
641
648
*Options:*
642
649
643
650
`content`: Set the text to process.
644
651
645
-
`image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `content`. You can determine the place of the image in the content as in the following: `Image: [img-21].\nCaption: This is a picture of a house`. In this case, `[img-21]` will be replaced by the embeddings of the image with id `21` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 21}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.
646
-
647
652
`embd_normalize`: Normalization for pooled embeddings. Can be one of the following values:
"prompt": { JSON_PROMPT_STRING_KEY: "I believe the meaning of life is <__media__>", JSON_MULTIMODAL_KEY: "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNk+A8AAQUBAScY42YAAAAASUVORK5CYII=" },
252
+
"seed": 42,
253
+
"temperature": 1.0,
254
+
"cache_prompt": False,
255
+
})
256
+
# MTMD is disabled on this model, so this should fail.
0 commit comments