-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Support for NemotronH Nano VLM with an optimized vision model (vLLM native) #23753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the NemotronH Nano VLM model, including a native vLLM implementation of the C-RADIO vision encoder. The changes are well-structured, introducing new model files for nano_vlm
and radio
, along with corresponding tests. My review has identified a critical bug in the new test file that would prevent it from running correctly, a potential TypeError
in the model initialization due to unsafe dictionary access, and the use of print
statements for debugging which should be replaced with proper logging. Addressing these points will improve the robustness and maintainability of the new model support.
pixel_values = [ | ||
img_processor( | ||
images, | ||
return_tensors='pt').pixel_values.to(torch_dtype)[:, :, :, :640] | ||
for images in images | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list comprehension for pixel_values
seems to have a bug. The img_processor
is called with a single PIL image, which returns a 3D tensor for pixel_values
. However, a 4D slice [:, :, :, :640]
is then applied to this 3D tensor, which will cause a runtime error.
To fix this, you can process each image as a list containing a single image to ensure the pixel_values
tensor is 4D. Also, using a more descriptive loop variable name would improve readability.
pixel_values = [ | |
img_processor( | |
images, | |
return_tensors='pt').pixel_values.to(torch_dtype)[:, :, :, :640] | |
for images in images | |
] | |
pixel_values = [ | |
img_processor( | |
[img], | |
return_tensors='pt').pixel_values.to(torch_dtype)[:, :, :, :640] | |
for img in images | |
] |
model_name = hf_config.args.get("model") | ||
hidden_size, num_layers, num_heads, intermediate_size = vit_dims.get( | ||
model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to vit_dims.get(model_name)
can return None
if model_name
is not found in the vit_dims
dictionary. This would cause a TypeError
when attempting to unpack the result into hidden_size, num_layers, num_heads, intermediate_size
. It's safer to check if the model name exists in the dictionary before attempting to get the value.
model_name = hf_config.args.get("model") | |
hidden_size, num_layers, num_heads, intermediate_size = vit_dims.get( | |
model_name) | |
model_name = hf_config.args.get("model") | |
if model_name not in vit_dims: | |
raise ValueError( | |
f"Unsupported ViT model type for Radio: {model_name}. " | |
f"Supported types are: {list(vit_dims.keys())}") | |
hidden_size, num_layers, num_heads, intermediate_size = vit_dims[ | |
model_name] |
print("in intervit cls token init num_tokens: ", num_tokens) | ||
print("in intervit cls token init num_registers: ", | ||
self.num_registers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR builds on #23644
In addition to supporting the new VL model, it introduces the vision encoder implementation (C-RADIO
) using vLLM’s native layer.
To reduce code duplication, the implementation leverages
InternVisionModel
blocks.