-
I saw we can use multi-thread to invoke APIs, ref. But the question is not the case. I also see Pytorch's inference is thread safe, ref. However, I'd like to double-confirm with you. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
A short answer is no. The KV caching mechanism uses forward hooks which are installed in the module objects and will cause race issues when used by multiple threads. The Even if you disabled KV caching, multi-threaded usage will be generally inefficient because of the GIL. Multiprocessing will buy you more, like in Multi-Instance GPU, and it may be more flexible for multithreading if you use the PyTorch C++ API as in your second reference. If you're integrating Whisper with a serving layer, it may support automatic batching e.g. in TensorFlow Serving, which is usually more efficient use of GPU resource. |
Beta Was this translation helpful? Give feedback.
-
what is the command line or switches to enable whisper to take advnateg of multiple GPUs? |
Beta Was this translation helpful? Give feedback.
A short answer is no.
The KV caching mechanism uses forward hooks which are installed in the module objects and will cause race issues when used by multiple threads. The
--threads
option provides a more low-level control on how the CPU operations are parallelized, but it's less relevant if you're using GPU.Even if you disabled KV caching, multi-threaded usage will be generally inefficient because of the GIL. Multiprocessing will buy you more, like in Multi-Instance GPU, and it may be more flexible for multithreading if you use the PyTorch C++ API as in your second reference.
If you're integrating Whisper with a serving layer, it may support automatic batching e.g. in TensorFlow Serving, …