Skip to content
Georgi Gerganov edited this page May 14, 2025 · 7 revisions

Setup llama.cpp server for Linux

Code completion server

Used for
- code completion

LLM type
- FIM (fill in the middle)

Instructions

  1. Download the release files for your OS from llama.cpp releases. (or build from source).
  2. Download the LLM model and run llama.cpp server (combined in one command)

CPU only

llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --port 8012 -ub 512 -b 512 --ctx-size 0 --cache-reuse 256

With Nvidia GPUs and installed cuda drivers

  • more than 16GB VRAM
    llama-server --fim-qwen-7b-default
  • less than 16GB VRAM
    llama-server --fim-qwen-3b-default
  • less than 8GB VRAM
    llama-server --fim-qwen-1.5b-default
    If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.

Chat server

Used for
- Chat with AI
- Chat with AI with project context
- Edit with AI
- Generage commit message

LLM type
- Chat Models

Instructions
Same like code completion server, but use chat model and a little bit different parameters.

CPU-only:
llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011

With Nvidia GPUs and installed cuda drivers

  • more than 16GB VRAM
    llama-server -hf ggml-org/Qwen2.5-Coder-7B-Instruct-Q8_0-GGUF --port 8011
  • less than 16GB VRAM
    llama-server -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF --port 8011
  • less than 8GB VRAM
    llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011

Embeddings server

Used for
- Chat with AI with project context

LLM type
- Embedding

Instructions
Same like code completion server, but use embeddings model and a little bit different parameters.
llama-server -hf ggml-org/Nomic-Embed-Text-V2-GGUF --port 8010 -ub 2048 -b 2048 --ctx-size 2048 --embeddings

Clone this wiki locally