-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Description
Your current environment
The output of python collect_env.py
==============================
System Info
==============================
OS : CentOS Stream 9 (x86_64)
GCC version : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-9)
Clang version : Could not collect
CMake version : Could not collect
Libc version : glibc-2.34
==============================
PyTorch Info
==============================
PyTorch version : 2.7.1+cpu
Is debug build : False
CUDA used to build PyTorch : None
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.11
==============================
CUDA / GPU Info
==============================
Is CUDA available : False
CUDA runtime version : No CUDA
CUDA_MODULE_LOADING set to : N/A
GPU models and configuration : No CUDA
Nvidia driver version : No CUDA
cuDNN version : No CUDA
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.1.2
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.0.2
[pip3] torch==2.7.1+cpu
[pip3] torchaudio==2.7.1+cpu
[pip3] torchvision==0.22.1+cpu
[pip3] transformers==4.55.4
[pip3] triton==3.3.1
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.10.1.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
==============================
Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
For offline inference, model loading from local path is broken when HF_HUB_OFFLINE is set to 1.
I am trying to load a model form local path for offline inferencing using code similar to below. /local/path/to/model
is the path to a model that contains HuggingFace model snapshot.
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/local/path/to/model")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
When I try to run this code after setting HF_HUB_OFFLINE=1
, LocalEntryNotFoundError
exception is raised with this error message:
"Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input."
This workflow used to work without any issues until recently before this PR was merged: #22526.
Upon more debugging, we found that the #22526 and #21680 might be introducing conflicting changes. With #21680, vLLM is trying to log non default arguments, which works fine on its own. But together with #22526, it introduces issue during logging of non default args:
- After specifying
/local/path/to/model
, the look up of the local model is fine. - Once the local model look up is done, vLLM tries to log non default args.
- During logging of non default args, vLLM tries to create a default instance of EngineArgs() when non-default args are present: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/utils.py#L316
- In this default instance, model is set to "Qwen/Qwen3-0.6B": https://github.com/vllm-project/vllm/blob/main/vllm/config/__init__.py#L275
- With [Fix] fix offline env use local mode path #22526, during post_init() of EngineArgs() vLLM tries to load "Qwen/Qwen3-0.6B" from disk here: https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L463
- When vLLM tries to load the model, it complains saying "Qwen/Qwen3-0.6B" is not found in local path. I do not have "Qwen/Qwen3-0.6B" in my local model cache, nor do I need it since I'm loading an entirely different model for inferencing.
During logging of non-default args, we should avoid loading model, as this should not prevent from logging non-default args. This issue is affecting our offline workflows with HF_HUB_OFFLINE=1
. Please help take a look at this issue.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.