Skip to content

Regression in Gemma-3n model quantization support #1765

@shubhra

Description

@shubhra

Summary

  • Re-quantizing Gemma-3n-E2B-it to W4A16 with the same recipe that previously worked now yields models with random / very poor accuracy on OpenLLM benchmarks. Suspected to affect both E2B and E4B; only E2B reproduced so far.
  • The regression localizes to self_attn.o_proj.weight_packed for layers ≥ 20: the bad model’s packed weights collapse to a constant 0x88 pattern; other packed tensors (q/k/v, MLP) look normal/close.

Suspected cause

  • Recent changes in Transformers Gemma-3n model definition / autowrap (module mapping).

Diff of the two models obtained with same recipe:
A - older run, good model
B - newer run with same recipe, model produces random accuracies

model.language_model.layers.20.self_attn.o_proj.weight_packed
  A: model-00001-of-00003.safetensors
     shape=(2048, 256) dtype=torch.int32 bytes=2097152
     u8_mean=136.000 u8_std=0.000 sum32=285212672 l2=1.969e+05
     hex[:16]=88888888888888888888888888888888
  B: model-00001-of-00003.safetensors
     shape=(2048, 256) dtype=torch.int32 bytes=2097152
     u8_mean=135.991 u8_std=43.678 sum32=285193004 l2=2.068e+05
     hex[:16]=46885797b66863cca9685cb3637b3a77

(Other tensors like q_proj, mlp.up/gate/down_proj are statistically similar across A/B.)

Model A: https://huggingface.co/RedHatAI/gemma-3n-E2B-it-quantized.w4a16
Model B: local model, can share if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions