-
Notifications
You must be signed in to change notification settings - Fork 74
Linux
- Download the release files for your OS from llama.cpp releases. (or build from source).
- Add the bin folder to PATH, so that it is globally available
The configurations below are left for a reference, but now it is possible to do it easier - add a model from the menu and select it.
Used for
- code completion
LLM type
- FIM (fill in the middle)
Instructions
CPU only
llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --port 8012 -ub 512 -b 512 --ctx-size 0 --cache-reuse 256
With Nvidia GPUs and installed cuda drivers
- more than 16GB VRAM
`llama-server --fim-qwen-7b-default -ngl 99`
- less than 16GB VRAM
`llama-server --fim-qwen-3b-default -ngl 99`
- less than 8GB VRAM
`llama-server --fim-qwen-1.5b-default -ngl 99`
If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.
Used for
- Chat with AI
- Chat with AI with project context
- Edit with AI
- Generage commit message
LLM type
- Chat Models
Instructions
Same like code completion server, but use chat model and a little bit different parameters.
CPU-only:
`llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011 -np 2`
With Nvidia GPUs and installed cuda drivers
- more than 16GB VRAM
`llama-server -hf ggml-org/Qwen2.5-Coder-7B-Instruct-Q8_0-GGUF --port 8011 -np 2`
- less than 16GB VRAM
`llama-server -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF --port 8011 -np 2`
- less than 8GB VRAM
`llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011 -np 2`
Used for
- Chat with AI with project context
LLM type
- Embedding
Instructions
Same like code completion server, but use embeddings model and a little bit different parameters.
`llama-server -hf ggml-org/Nomic-Embed-Text-V2-GGUF --port 8010 -ub 2048 -b 2048 --ctx-size 2048 --embeddings`