RPC distribution of models possible in VRAM and RAM mix? Or only VRAM support? #8720
Replies: 1 comment
-
This is very old but I simply created 2 RPC servers, one with CUDA GPU ( |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I came accross the neat RPC distribution feature of llama.cpp and wanted to give it a shot: https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md
Can somebody confirm to me that this should work by now also with offloading parts of the model to GPU and parts of the model to a CPU (RAM) based instance?
I tried the following with the model "mistral-7b-instruct-v0.1.Q4_K_M.gguf" (which I saw predownloaded in the models folder of my llama.cpp instance)
On a smartphone (4GB RAM) I installed llama.cpp via termux and ran the CPU variant (I hope my assumption is correct utilizing on the phone "cmake .. - DGGML_RPC=ON") via "bin/rpc-server -p 50052". (I also tried the same variant with a laptop of my which also has no CUDA support enabled).
On my desktop machine (12GB VRAM - RTX3060) I ran another instance of the llama.cpp binary with CUDA support.
Then I started he client to address both of them (one localhost, one remote host so to say, when specificying the ip addresses).
I was able to offload and execute the mistral model (although very slowly as expected).
When trying other models, that I had downloaded separately into the models folder, e.g. Mixtral8x7B also in the Q4_K_M variant, I was unable to perform the offloading.
The programm got stuck in "loading tensors" section and the connection to the two remote clients got terminated as well when reaching this point.
I would appreciate if somebody could tell me, what to look out for in terms of executing RPC distribution, that might go beyond the explanations in above README.md.
Does the feature only work with VRAM? Not to be mixed with RAM or not to run only on RAM?
Does one have to be carefull when selecting other models to execute?
Besides the mistral model, all included models of ollama had issue and didn't perform as well.
Thanks in advance for any feedback. Highly appreciated.
I am not 100% sure if this question has been already explained somewhere else. A quick search of mine in the issues and the forum, however didn't fully satisfy my understanding.
Beta Was this translation helpful? Give feedback.
All reactions