Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.

Conversation

dsikka
Copy link
Contributor

@dsikka dsikka commented Jul 25, 2023

Note: there are currently no models for captioning on sparsezoo.
We have an open issue with open_clip to track some of the issues with CoCa models that have been brought up.

Summary

clip_caption

  • Implement CLIPCaptioning and CLIPDecoder pipelines. These pipelines allow us to produce captions given an image. This leverages the previous CLIPVisual and CLIPText Pipelines that were implemented for zeroshot, with some modifications to make them more generic
  • The captioning pipeline adds a _generate function which is adapted from open_clip and applies BeamSearch to build the caption: https://github.com/neuralmagic/open_clip/blob/onnx-edit/src/open_clip/coca_model.py
  • One caveat is that in open_clip's implementation, the input sequence length is dynamic. We're using padded sequences
  • Also, the exported onnx models are all originally from open_clip

Testing

  • Added tests to the original clip tests
  • Also ran the following script to generate captions for various images:
from deepsparse import BasePipeline, Pipeline
from deepsparse.clip import CLIPCaptionInput, CLIPCaptionPipeline, CLIPVisualInput

root = "caption_models"
model_path_visual = f"{root}/clip_visual.onnx"
model_path_text = f"{root}/clip_text.onnx"
model_path_decoder = f"{root}/clip_text_decoder.onnx"

kwargs = {
    "visual_model_path": model_path_visual,
    "text_model_path": model_path_text,
    "decoder_model_path": model_path_decoder,
}
pipeline = BasePipeline.create(task="clip_caption", **kwargs)

pipeline_input = CLIPCaptionInput(image=CLIPVisualInput(images="mountain.jpg"))
output = pipeline(pipeline_input)

Examples of images and the generated caption:

mountain
Caption: a view of mountains in the background .

thailand
Caption: an adult elephant and a baby elephant .

mug
Caption: a cup of coffee .

@dsikka dsikka changed the base branch from main to clip_zshot July 25, 2023 20:49
@dsikka dsikka marked this pull request as ready for review July 27, 2023 20:59
@dsikka dsikka force-pushed the captioning branch 4 times, most recently from 743bd95 to 402bfc6 Compare July 31, 2023 23:54
@dsikka dsikka requested review from bfineran and dbogunowicz August 1, 2023 00:53
@dsikka dsikka requested a review from Satrat August 1, 2023 00:53
@dsikka dsikka assigned dsikka and unassigned rahul-tuli Aug 1, 2023
@dsikka dsikka requested a review from rahul-tuli August 1, 2023 00:54
dbogunowicz
dbogunowicz previously approved these changes Aug 1, 2023
bfineran
bfineran previously approved these changes Aug 1, 2023
Base automatically changed from clip_zshot to main August 2, 2023 18:03
@bfineran bfineran dismissed stale reviews from dbogunowicz and themself August 2, 2023 18:03

The base branch was changed.

@dsikka dsikka merged commit ffeb98f into main Aug 7, 2023
@dsikka dsikka deleted the captioning branch August 7, 2023 15:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants