Skip to content
Discussion options

You must be logged in to vote

A short answer is no.

The KV caching mechanism uses forward hooks which are installed in the module objects and will cause race issues when used by multiple threads. The --threads option provides a more low-level control on how the CPU operations are parallelized, but it's less relevant if you're using GPU.

Even if you disabled KV caching, multi-threaded usage will be generally inefficient because of the GIL. Multiprocessing will buy you more, like in Multi-Instance GPU, and it may be more flexible for multithreading if you use the PyTorch C++ API as in your second reference.

If you're integrating Whisper with a serving layer, it may support automatic batching e.g. in TensorFlow Serving, …

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by jongwook
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants