Skip to content
Benson Wong edited this page Aug 21, 2025 · 44 revisions

About

These guides are intended to help you get started quickly with llama-swap configuration snippets.

Tip

Looking for help with the Configuration? It was written to be LLM friendly. Try copy/pasting the example config into an LLM first to see if it can answer your question.

Model Guides

Company Model VRAM Requirement Server Notes Link
BGE reranker v2 m3 343MB llama.cpp v1/rerank API llama-server link
Google Gemma 3 27B 24GB to 27GB llama.cpp 100K context on single and dual 24GB GPUs link
Meta llama-3.3-70B 55GB llama.cpp 13 to 20 tok/sec with 2x3090 and P40 for speculative decoding link
Meta llama4-scout 68.62 GB llama.cpp Fully loading scout with 62K context onto 3x24GB GPUs link
Mistral Small 3.1 24GB llama.cpp text and vision support, 32K context link
Nomic-AI nomic-embed-text v1.5 280MB llama.cpp v1/embeddings with llama-server link
OpenAI whisper-large-v3-turbo 1.4GB whisper.cpp v1/audio/speech text to speech with whisper.cpp link
Qwen qwen3-30b-a3b 24 GB llama.cpp 113tok/s on a 3090 link
Qwen QwQ, Coder 32B 24 GB to 48GB llama.cpp Local copilot with Aider, QwQ and Qwen2.5 Coder 32B link

Community Contributed Guides

Note

These guides are contributed by community members and have not been verified by llama-swap maintainers.

Contributor Guide
@WesleyFister docker-compose example with built in config.yaml
@ramblingcoder Docker in docker setup with llama-swap

Use Case Guides

Clone this wiki locally