Testing the deepseek-r1 and openai's o3 models #208

lamikr · 2025-02-03T20:17:58Z

lamikr
Feb 3, 2025
Maintainer

I have tested both the llama.cpp and vllm with deepseek and there seems to be couple of big differences that I do not properly understand or have just configured in a way that causes differences

vllm

can use directly the deepseeks model that fitst to gpu with 16GB of vram (deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, or 7B, ...)
It takes little over 10 minutes from VLLM to load the deepseek 7B model (it probably loads whole model to gpu)
it will activate the web server only after loading the model
it's own http server web ui can not itself be directly used for chat but it offers open-ai compatible API that can be used by other open-ai clients like for example configuring open webui to connect to it. (http://127.0.0.1:8080)
In chat it responds by doing these multiple iterations of responses like expected.before printing final output
responses are mix of chinese and english text
launch script with some comments about the open webui usage is here

llama.cpp/llama-cpp-python (tested server launch using both the native llama-server and llama-cpp-python api)

needs to use gguf models and the versions offered by the deepseek for gguf variants are too big, so they need to be either converted/trained by myself or used instead version converted by somebody else
I ended up downloading with the converted model from bartowski: https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF/resolve/main/DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf
model seems to load almost immediately and also the web-ui is active
It's http server offers also the openai compatibility, so I was able to configure also the open webui to connect to it and then use it via that way in addition of using it from the llama.cpp's webui (http://127.0.0.1:8080)
it own http server webui works for chatting with the model. Responses are fast but don't show the iterations before final response (http://127.0.0.1:8000). Output is english only
launch script which uses llama-cpp-python api is here

In vllm case I am for example sure whether I use the correct jinja file as a template parameter and could that be the reason why output is mix of chinese and english text. (--chat-template ./template_chatglm.jinja) Or would there be some option that would allow the webui to launch immediately before loading the whole model to GPU's vram.

csillag · 2025-04-07T17:05:11Z

csillag
Apr 7, 2025

Did you make some progress figuring out the proper chat template for vllm and deepseek ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Testing the deepseek-r1 and openai's o3 models #208

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Testing the deepseek-r1 and openai's o3 models #208

Uh oh!

lamikr Feb 3, 2025 Maintainer

Replies: 1 comment

Uh oh!

csillag Apr 7, 2025

lamikr
Feb 3, 2025
Maintainer

csillag
Apr 7, 2025