-
Notifications
You must be signed in to change notification settings - Fork 79
Linux
Used for
- code completion
LLM type
- FIM (fill in the middle)
Instructions
- Download the release files for your OS from llama.cpp releases. (or build from source).
- Download the LLM model and run llama.cpp server (combined in one command)
CPU only
llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --port 8012 -ub 512 -b 512 --ctx-size 0 --cache-reuse 256
With Nvidia GPUs and installed cuda drivers
- more than 16GB VRAM
llama-server --fim-qwen-7b-default
- less than 16GB VRAM
llama-server --fim-qwen-3b-default
- less than 8GB VRAM
llama-server --fim-qwen-1.5b-default
If the file is not available (first time) it will be downloaded (this could take some time) and after that llama.cpp server will be started.
Used for
- Chat with AI
- Chat with AI with project context
- Edit with AI
- Generage commit message
LLM type
- Chat Models
Instructions
Same like code completion server, but use chat model and a little bit different parameters.
CPU-only:
llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011
With Nvidia GPUs and installed cuda drivers
- more than 16GB VRAM
llama-server -hf ggml-org/Qwen2.5-Coder-7B-Instruct-Q8_0-GGUF --port 8011
- less than 16GB VRAM
llama-server -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF --port 8011
- less than 8GB VRAM
llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Instruct-Q8_0-GGUF --port 8011
Used for
- Chat with AI with project context
LLM type
- Embedding
Instructions
Same like code completion server, but use embeddings model and a little bit different parameters.
llama-server -hf ggml-org/Nomic-Embed-Text-V2-GGUF --port 8010 -ub 2048 -b 2048 --ctx-size 2048 --embeddings