Skip to content
igardev edited this page Aug 11, 2025 · 7 revisions

Setup llama.cpp server for Linux

  1. Download the release files for your OS from llama.cpp releases. (or build from source).
  2. Add the bin folder to PATH, so that it is globally available

The configurations below are left for a reference, but now it is possible to do it easier - add a model from the menu and select it.

Code completion server

Used for
- code completion

LLM type
- FIM (fill in the middle)

Instructions

CPU only

llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --port 8012 -ub 512 -b 512 --ctx-size 0 --cache-reuse 256

With Nvidia GPUs and installed cuda drivers

  • more than 16GB VRAM
`llama-server --fim-qwen-7b-default -ngl 99`  
  • less than 16GB VRAM
`llama-server --fim-qwen-3b-default -ngl 99`  
  • less than 8GB VRAM
`llama-server --fim-qwen-1.5b-default -ngl 99`  

If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.

Chat server

Used for
- Chat with AI
- Chat with AI with project context
- Edit with AI
- Generage commit message

LLM type
- Chat Models

Instructions
Same like code completion server, but use chat model and a little bit different parameters.

CPU-only:

`llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011 -np 2`  

With Nvidia GPUs and installed cuda drivers

  • more than 16GB VRAM
`llama-server -hf ggml-org/Qwen2.5-Coder-7B-Instruct-Q8_0-GGUF --port 8011 -np 2`  
  • less than 16GB VRAM
`llama-server -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF --port 8011 -np 2`  
  • less than 8GB VRAM
`llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011 -np 2`  

Embeddings server

Used for
- Chat with AI with project context

LLM type
- Embedding

Instructions
Same like code completion server, but use embeddings model and a little bit different parameters.

`llama-server -hf ggml-org/Nomic-Embed-Text-V2-GGUF --port 8010 -ub 2048 -b 2048 --ctx-size 2048 --embeddings`  
Clone this wiki locally