-
Notifications
You must be signed in to change notification settings - Fork 11
Description
To get this to work, first you have to get an external AMD GPU working on Pi OS. The most up-to-date instructions are currently on my website: Get an AMD Radeon 6000/7000-series GPU running on Pi 5.
Once your AMD graphics card is working (and can output video), install dependencies and compile llama.cpp
with the Vulkan backend:
# Install Vulkan SDK, glslc, and cmake
sudo apt install -y libvulkan-dev glslc cmake
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with Vulkan
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release
# Test the output binary (with "-ngl 33" to offload all layers to GPU)
./build/bin/llama-cli -m "PATH_TO_MODEL" -p "Hi you how are you" -n 50 -e -ngl 33 -t 4
# You should see in the output, ggml_vulkan detected your GPU. For example:
# ggml_vulkan: Found 1 Vulkan devices:
# ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 64
Then you can download a model (e.g. off HuggingFace) and run it:
# Download llama3.2:3b
cd models && wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Run it.
cd ../
./build/bin/llama-cli -m "models/Llama-3.2-3B-Instruct-Q4_K_M.gguf" -p "Why is the blue sky blue?" -n 50 -e -ngl 33 -t 4
I want to test with models:
Device | CPU/GPU | Model | Speed | Power (Peak) |
---|---|---|---|---|
Pi 5 - 8GB | CPU | llama3.2:3b | 4.61 Tokens/s | 13.9 W |
Pi 5 - 8GB | CPU | llama3.2:8b | 1.99 Tokens/s | 13.2 W |
Pi 5 - 8GB | CPU | llama2:13b | DNF | DNF |
Pi 5 - 8GB / AMD RX 6500 XT 8GB | GPU | llama3.2:3b | 39.82 Tokens/s | 88 W |
Pi 5 - 8GB / AMD RX 6500 XT 8GB | GPU | llama3.1:8b | 22.42 Tokens/s | 95.7 W |
Pi 5 - 8GB / AMD RX 6500 XT 8GB | GPU | llama2:13b | 2.03 Tokens/s | 48.3 W |
Pi 5 - 8GB / AMD RX 6700 XT 12GB | GPU | llama3.2:3b | 49.01 Tokens/s | 94 W |
Pi 5 - 8GB / AMD RX 6700 XT 12GB | GPU | llama3.1:8b | 39.70 Tokens/s | 135 W |
Pi 5 - 8GB / AMD RX 6700 XT 12GB | GPU | llama2:13b | 3.98 Tokens/s | 95 W |
Pi 5 - 8GB / AMD RX 7600 8GB | GPU | llama3.2:3b | 48.47 Tokens/s | 156 W |
Pi 5 - 8GB / AMD RX 7600 8GB | GPU | llama3.1:8b | 32.60 Tokens/s | 174 W |
Pi 5 - 8GB / AMD RX 7600 8GB | GPU | llama2:13b | 2.42 Tokens/s | 106 W |
Pi 5 - 8GB / AMD Radeon Pro W7700 16GB | GPU | llama3.2:3b | 56.14 Tokens/s | 145 W |
Pi 5 - 8GB / AMD Radeon Pro W7700 16GB | GPU | llama3.1:8b | 39.87 Tokens/s | 52 W |
Pi 5 - 8GB / AMD Radeon Pro W7700 16GB | GPU | llama2:13b | 4.38 Tokens/s | 108 W |
Note: Ollama currently doesn't support Vulkan, and some parts of llama.cpp assume x86 still, not Arm or RISC-V.
Note 2: With larger models, you may run into an error like vk::Device::allocateMemory: ErrorOutOfDeviceMemory
—see bug Vulkan Device memory allocation failed. If so, try scaling back to 1 or 2 GB of RAM for the buffer:
export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=2147483647 # 2GB buffer
export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=1073741824 # 1GB buffer
Note 3: Power consumption measured at the wall (total system power draw) using a ThirdReality Zigbee Smart Outlet through Home Assistant. I don't have a way of measuring total energy consumed per test (e.g. Joules) but that would be nice at some point.