Question to attention computation

Hi, thank you for the amazing demo and doc! I have a question regarding this [section](https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/zero_inference/model-support.md#5-compute-attention-scores) in zero-inference. It is mentioned that `"Thus, our current implementation computes attention scores on CPU"`. May I ask if there is a detailed comparison of the latency or throughput between GPU-attention and CPU-attention to support this desicion? I am also serious about the implementation detail of the CPU-attention computation. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question to attention computation #944

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question to attention computation #944

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions