Inconsistent 'query_pre_attn_scalar' Setting Between 9B and 27B Models

In the recent commit, I have noticed an inconsistency in the configuration of the `query_pre_attn_scalar` parameter between the 9B and 27B models in this repository.

Specifically:

In the 9B model, `query_pre_attn_scalar` is not explicitly set and appears to use the default value derived from head_dim (256, not 224 which can be derived by # hidden_size / # attention_heads).
In the 27B model, `query_pre_attn_scalar` is explicitly set to 144 (# hidden_size / # attention_heads).

Could you please provide some insight into the reasoning behind this difference? Is there a specific rationale for not setting `query_pre_attn_scalar` in the 9B model while explicitly setting it in the 27B model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent 'query_pre_attn_scalar' Setting Between 9B and 27B Models #71

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent 'query_pre_attn_scalar' Setting Between 9B and 27B Models #71

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions