RPC distribution of models possible in VRAM and RAM mix? Or only VRAM support? #8720

florianjomrich · 2024-07-27T09:42:43Z

florianjomrich
Jul 27, 2024

I came accross the neat RPC distribution feature of llama.cpp and wanted to give it a shot: https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md

Can somebody confirm to me that this should work by now also with offloading parts of the model to GPU and parts of the model to a CPU (RAM) based instance?

I tried the following with the model "mistral-7b-instruct-v0.1.Q4_K_M.gguf" (which I saw predownloaded in the models folder of my llama.cpp instance)

On a smartphone (4GB RAM) I installed llama.cpp via termux and ran the CPU variant (I hope my assumption is correct utilizing on the phone "cmake .. - DGGML_RPC=ON") via "bin/rpc-server -p 50052". (I also tried the same variant with a laptop of my which also has no CUDA support enabled).

On my desktop machine (12GB VRAM - RTX3060) I ran another instance of the llama.cpp binary with CUDA support.

Then I started he client to address both of them (one localhost, one remote host so to say, when specificying the ip addresses).

I was able to offload and execute the mistral model (although very slowly as expected).

When trying other models, that I had downloaded separately into the models folder, e.g. Mixtral8x7B also in the Q4_K_M variant, I was unable to perform the offloading.

The programm got stuck in "loading tensors" section and the connection to the two remote clients got terminated as well when reaching this point.

I would appreciate if somebody could tell me, what to look out for in terms of executing RPC distribution, that might go beyond the explanations in above README.md.

Does the feature only work with VRAM? Not to be mixed with RAM or not to run only on RAM?
Does one have to be carefull when selecting other models to execute?

Besides the mistral model, all included models of ollama had issue and didn't perform as well.

Thanks in advance for any feedback. Highly appreciated.

I am not 100% sure if this question has been already explained somewhere else. A quick search of mine in the issues and the forum, however didn't fully satisfy my understanding.

JoaGamo · 2025-09-03T13:27:22Z

JoaGamo
Sep 3, 2025

This is very old but I simply created 2 RPC servers, one with CUDA GPU (CUDA_VISIBLE_DEVICES=0 /root/llama.cpp/build-rpc-cuda/bin/rpc-server -H 0.0.0.0 -p 50052 &) and another with the CPU backend /root/llama.cpp/build/bin/rpc-server -H 0.0.0.0 -p 50054 &
Then I added them in my llama-server parameters as --rpc "192.168.0.234:50052,192.168.0.195:50054". It splits between them correctly, but I prefer to use tensor-split to fully load the GPU and offload the rest to other CPU backends.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RPC distribution of models possible in VRAM and RAM mix? Or only VRAM support? #8720

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RPC distribution of models possible in VRAM and RAM mix? Or only VRAM support? #8720

Uh oh!

Uh oh!

florianjomrich Jul 27, 2024

Replies: 1 comment

Uh oh!

JoaGamo Sep 3, 2025

florianjomrich
Jul 27, 2024

JoaGamo
Sep 3, 2025