-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
[Frontend] Gemma3n audio transcriptions
/translations
endpoint
#23735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables Gemma3n for audio transcription and translation endpoints, which is a great addition. The changes include a soft API modification to add a to_language
parameter, which will be useful for future enhancements. The tests have been updated to cover Gemma3n, including parameterization over different models, which is good practice. I've found one issue regarding input validation for the new model implementation that should be addressed.
if task_type == "transcribe" and full_lang_name: | ||
prompt += f" into {full_lang_name}" | ||
elif task_type == "translate": | ||
if full_lang_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should validate that both languages are valid when doing translation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming languages are validated beforehand here https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/speech_to_text.py#L91.
Do you have some extra checks in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, in that case perhaps we should pass the full_lang_name
directly into the method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that we should have a separate function for each task to reduce branching
Signed-off-by: NickLucche <[email protected]>
This PR enables Gemma3n for use with the audio-specific endpoints (transcriptions/translations).
I've also added a "soft" interface changes to add a
to_language
parameter to the API as I found it helps some with translation.The rationale is that I would like to keep this changes lightweight for now as we're only slightly steering away from the original oai whisper-only specs, and instead see where the broader audio community wants it to be.
No chunking for now, as I believe a long-audio capability assessment is in order for this model.
A list of additional minor changes:
I also plan to follow up with revamped benchmark+evaluation scripts to better cover these models.