Replies: 4 comments 13 replies
-
|
#15602 is relevant |
Beta Was this translation helpful? Give feedback.
-
|
Consider also that Q, K, and V need to have the same type for batched GEMM. |
Beta Was this translation helpful? Give feedback.
-
|
An additional opportunity for fusion would be to make the K and V matrices write their results directly to the KV cache but then you would need to somehow define output pointers and strides per channel. |
Beta Was this translation helpful? Give feedback.
-
|
While loading a model you don't know if there are going to be LoRAs, and it would break mmap. It may be better to do this fusion in the backends. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In llama-graph.cpp, typically QKV follows
If there are no LoRA adapaters and no GQA (i.e. Q K and V all have the same dims), this can be a GEMM between activations A and Q,K,V. I think the only requirement would be that Q, K, V weights are allocated contiguously. From
llama-model::load_tensors, is it possible to change thisto
We can get a rough benefit on performance on cases where there is no GQA with this simple change
Beta Was this translation helpful? Give feedback.
All reactions